Introduction

In Mecklenburg County, the housing market shows good trend recent years. Housing Price Prediction is necessary and helpful. The hedonic model is a theoretical framework for predicting home prices by deconstructing house price into the value of its constituent parts.

Q: What makes this a difficult exercise? A: (We have a lot to say…) How to choose a good independent variable / We met a lot of errors when we transfer a numerical variable to a categorical variable/ Random error in every trunks…Fortunately! We finally did it.

Q: What is your overall modeling strategy? A: We first make OLS predictions according to the original dataset, and then filter the predictions according to the performance of the r square and MAPE to continuously optimize our model.

Q: Briefly summarize your results. A: Just like our team name – SUPER MODEL, which means our regression model will be a super model to predict the price!

Let’s just have a look! We try to make model accurate!

Data

mapTheme <- function(base_size = 50) {
  theme(
    text = element_text(color = "black"),
    plot.title = element_text(size = 14,colour = "black"),
    plot.subtitle=element_text(face="italic"),
    plot.caption=element_text(hjust=0),
    axis.ticks = element_blank(),
    panel.background = element_blank(),axis.title = element_blank(),
    axis.text = element_blank(),
    axis.title.x = element_blank(),
    axis.title.y = element_blank(),
    panel.grid.minor = element_blank(),
    panel.border = element_rect(colour = "black", fill=NA, size=1),
    legend.background = element_blank(),
    legend.key = element_blank(),
    legend.position = "right"
    )
}

Data Collecting

There are several way to collect data. Our main source is an official website from Mecklenburg County offering a large amount of data including traffic transit, parks, churchs, colleges, neighborhoods, and so on.

The second source is census tract data which can be accessed with tidycensus package in R.

studentdata <- 
  st_read("/Users/paprika/Desktop/Courses/MUSA 508 Public Policy/Midterm/data/studentData.geojson") %>%
  st_transform('ESRI:102719')
## Reading layer `studentData' from data source 
##   `/Users/paprika/Desktop/Courses/MUSA 508 Public Policy/Midterm/data/studentData.geojson' 
##   using driver `GeoJSON'
## Simple feature collection with 46183 features and 71 fields
## Geometry type: POINT
## Dimension:     XY
## Bounding box:  xmin: -81.05626 ymin: 35.01345 xmax: -80.55941 ymax: 35.51012
## Geodetic CRS:  WGS 84
data <-
  studentdata[,c('stname','zipcode','price','sale_year','storyheigh','fullbaths','halfbaths','bedrooms','units','yearbuilt','heatedarea','heatedfuel','foundation','bldggrade','shape_Leng','shape_Area','musaID','toPredict','geometry')]

collegedata <- 
  st_read("/Users/paprika/Desktop/Courses/MUSA 508 Public Policy/Midterm/data/colleges/Colleges.shp") %>%
  st_transform('ESRI:102719')
## Reading layer `Colleges' from data source 
##   `/Users/paprika/Desktop/Courses/MUSA 508 Public Policy/Midterm/data/colleges/Colleges.shp' 
##   using driver `ESRI Shapefile'
## Simple feature collection with 25 features and 11 fields
## Geometry type: POINT
## Dimension:     XY
## Bounding box:  xmin: 1407818 ymin: 481350 xmax: 1493233 ymax: 642582.4
## Projected CRS: NAD83 / North Carolina (ftUS)
parkdata <- 
  st_read("/Users/paprika/Desktop/Courses/MUSA 508 Public Policy/Midterm/data/park_locations/Park_Locations.shp") %>%
  st_transform('ESRI:102719') 
## Reading layer `Park_Locations' from data source 
##   `/Users/paprika/Desktop/Courses/MUSA 508 Public Policy/Midterm/data/park_locations/Park_Locations.shp' 
##   using driver `ESRI Shapefile'
## Simple feature collection with 351 features and 97 fields
## Geometry type: POINT
## Dimension:     XY
## Bounding box:  xmin: 1394440 ymin: 467223.2 xmax: 1524300 ymax: 644298.8
## Projected CRS: NAD83 / North Carolina (ftUS)
churchdata <- 
  st_read("/Users/paprika/Desktop/Courses/MUSA 508 Public Policy/Midterm/data/churches/Churches.shp") %>%
  st_transform('ESRI:102719') 
## Reading layer `Churches' from data source 
##   `/Users/paprika/Desktop/Courses/MUSA 508 Public Policy/Midterm/data/churches/Churches.shp' 
##   using driver `ESRI Shapefile'
## Simple feature collection with 880 features and 44 fields
## Geometry type: POINT
## Dimension:     XY
## Bounding box:  xmin: 1392478 ymin: 466634 xmax: 1528037 ymax: 644203
## Projected CRS: NAD83 / North Carolina (ftUS)
tfdata <- 
  st_read("/Users/paprika/Desktop/Courses/MUSA 508 Public Policy/Midterm/data/ncdot_trafficestimates/NCDOT_TrafficEstimates.shp") %>%
  st_transform('ESRI:102719') 
## Reading layer `NCDOT_TrafficEstimates' from data source 
##   `/Users/paprika/Desktop/Courses/MUSA 508 Public Policy/Midterm/data/ncdot_trafficestimates/NCDOT_TrafficEstimates.shp' 
##   using driver `ESRI Shapefile'
## Simple feature collection with 2064 features and 6 fields
## Geometry type: POINT
## Dimension:     XY
## Bounding box:  xmin: 1373587 ymin: 448274.1 xmax: 1553672 ymax: 667493
## Projected CRS: NAD83 / North Carolina (ftUS)
bd <- 
  st_read("/Users/paprika/Desktop/Courses/MUSA 508 Public Policy/Midterm/data/npa/npa.shp") %>%
  st_transform('ESRI:102719')
## Reading layer `npa' from data source 
##   `/Users/paprika/Desktop/Courses/MUSA 508 Public Policy/Midterm/data/npa/npa.shp' 
##   using driver `ESRI Shapefile'
## Simple feature collection with 458 features and 4 fields
## Geometry type: POLYGON
## Dimension:     XY
## Bounding box:  xmin: 1384164 ymin: 460652.7 xmax: 1536927 ymax: 647732.9
## Projected CRS: NAD83 / North Carolina (ftUS)
college <- select(collegedata, SchoolName)
church <- select(churchdata, address)
park <- select(parkdata, prkname)
transit <- select(tfdata, county, location)
transit <- transit[transit$county=="MECKLENBURG",]
st_c <- st_coordinates

bd.sf <-
  bd %>%
  st_as_sf(coords = c("Longitude", "Latitude"), crs = 4326, agr = "constant") %>%
  st_transform('ESRI:102719') %>%
    distinct()

data.sf <- 
  data %>% 
  st_as_sf(coords = c("Longitude", "Latitude"), crs = 4326, agr = "constant") %>%
  st_transform('ESRI:102719') %>%
    distinct()

college.sf <-
  college %>%
    st_as_sf(coords = c("Longitude", "Latitude"), crs = 4326, agr = "constant") %>%
    st_transform('ESRI:102719') %>%
    distinct()

church.sf <-
  church %>%
    st_as_sf(coords = c("Longitude", "Latitude"), crs = 4326, agr = "constant") %>%
    st_transform('ESRI:102719') %>%
    distinct()

park.sf <-
  park %>%
    st_as_sf(coords = c("Longitude", "Latitude"), crs = 4326, agr = "constant") %>%
    st_transform('ESRI:102719') %>%
    distinct()

transit.sf <-
  transit %>%
    st_as_sf(coords = c("Longitude", "Latitude"), crs = 4326, agr = "constant") %>%
    st_transform('ESRI:102719') %>%
    distinct()

Data Wrangling

In our model, the dependent variable is house sale price.

In addition, there are 3 main factors:

1.Interal characteristics
2.Amenities of decision factor
3.Spatial structure

First thing we need to do is to score yearbuilt with quantile.

divyear <-
  data.sf$yearbuilt
pctdivyear <-
  quantile(divyear,c(0,0.25,0.50,0.75,1))

data.sf$year = 0
data.sf$year[which(data.sf$yearbuilt < 1979 )] =1
data.sf$year[which((data.sf$yearbuilt >= 1979)&(data.sf$yearbuilt < 1999))] =2
data.sf$year[which((data.sf$yearbuilt >= 1999)&(data.sf$yearbuilt < 2015))] =3
data.sf$year[which((data.sf$yearbuilt >= 2015)&(data.sf$yearbuilt < 2022))] =4
data.sf <- select(data.sf, -c(yearbuilt))

Feature Engineering

Second, we need to calculates the ‘average nearest neighbor distance’ from each home sale to its k nearest neighbor transit stop, college, church and park. A generalizable model is also one that predicts with comparable accuracy across different groups. We predicted each of these features are correlated with house price.

bd.sf <-bd.sf[,c('id')]
alldata <- st_join(data.sf, bd.sf, join = st_intersects)
alldata$id <- as.factor(alldata$id)
st_c <- st_coordinates

data.sf <-
  data.sf %>% 
    mutate(
      college_nn1 = nn_function(st_c(data.sf), st_c(college.sf), 1),
      church_nn1 = nn_function(st_c(data.sf), st_c(church.sf), 1),
      park_nn2 = nn_function(st_c(data.sf), st_c(park.sf), 2),
      transit_nn1 = nn_function(st_c(data.sf), st_c(transit.sf), 1),
      transit_nn2 = nn_function(st_c(data.sf), st_c(transit.sf), 2),
      transit_nn3 = nn_function(st_c(data.sf), st_c(transit.sf), 3),
      transit_nn4 = nn_function(st_c(data.sf), st_c(transit.sf), 4),
      transit_nn5 = nn_function(st_c(data.sf), st_c(transit.sf), 5),
      ) 
map2 <- ggplot() + geom_sf(data = bd.sf, fill = "grey85", color = "white", size = 0.2) +
  stat_density2d(data = data.frame(st_coordinates(college.sf)), 
                 aes(X, Y, fill = ..level.., alpha = ..level..),
                 bins = 50, geom = 'polygon') +
  scale_fill_gradient(low = "#5252B8", high = "#D0A561", name = "Density") +
  scale_alpha(range = c(0.00, 0.3), guide = FALSE) +
  labs(title = "Density of College") +
  mapTheme()
map3 <- ggplot() + geom_sf(data = bd.sf, fill = "grey85", color = "white", size = 0.2) +
  stat_density2d(data = data.frame(st_coordinates(transit.sf)), 
                 aes(X, Y, fill = ..level.., alpha = ..level..),
                 bins = 50, geom = 'polygon') +
  scale_fill_gradient(low = "#5252B8", high = "#D0A561", name = "Density") +
  scale_alpha(range = c(0.00, 0.3), guide = FALSE) +
  labs(title = "Density of Transit") +
  mapTheme()
map4 <- ggplot() + geom_sf(data = bd.sf, fill = "grey85", color = "white", size = 0.2) +
  stat_density2d(data = data.frame(st_coordinates(park.sf)), 
                 aes(X, Y, fill = ..level.., alpha = ..level..),
                 bins = 30, geom = 'polygon') +
  scale_fill_gradient(low = "#5252B8", high = "#D0A561", name = "Density") +
  scale_alpha(range = c(0.00, 0.3), guide = FALSE) +
  labs(title = "Density of Park") +
  mapTheme()
map5 <- ggplot() + geom_sf(data = bd.sf, fill = "grey85", color = "white", size = 0.2) +
  stat_density2d(data = data.frame(st_coordinates(church.sf)), 
                 aes(X, Y, fill = ..level.., alpha = ..level..),
                 bins = 30, geom = 'polygon') +
  scale_fill_gradient(low = "#5252B8", high = "#D0A561", name = "Density") +
  scale_alpha(range = c(0.00, 0.3), guide = FALSE) +
  labs(title = "Density of Church") +
mapTheme()
ggarrange(map2, map3, map4, map5, ncol = 2, nrow = 2)
## Warning: It is deprecated to specify `guide = FALSE` to remove a guide. Please use `guide = "none"` instead.
## It is deprecated to specify `guide = FALSE` to remove a guide. Please use `guide = "none"` instead.
## It is deprecated to specify `guide = FALSE` to remove a guide. Please use `guide = "none"` instead.
## It is deprecated to specify `guide = FALSE` to remove a guide. Please use `guide = "none"` instead.

In order to ensure that the model results are more fit, we did data type transformation here, such as character type “storyheigh”, “heatedfuel”, “foundation”,“bldggrade” etc. We changed their type into categorical type.

data.sf$year <- as.factor(data.sf$year)
data.sf$storyheigh <- as.factor(data.sf$storyheigh)
data.sf$heatedfuel <- as.factor(data.sf$heatedfuel)
data.sf$foundation <- as.factor(data.sf$foundation)
data.sf$bldggrade <- as.factor(data.sf$bldggrade)

Getting Census Data

census_api_key("a9e713a06a0a0f8ec8531e047c9d01e7d9f507d9", overwrite = TRUE, install= TRUE)
## Your original .Renviron will be backed up and stored in your R HOME directory if needed.
## Your API key has been stored in your .Renviron and can be accessed by Sys.getenv("CENSUS_API_KEY"). 
## To use now, restart R or run `readRenviron("~/.Renviron")`
## [1] "a9e713a06a0a0f8ec8531e047c9d01e7d9f507d9"
census_vars <- c( "B01001_001E", # ACS total Pop estimate
                  "B25002_001E", # Estimate of total housing units
                  "B25002_003E", # Number of vacant housing units
                  "B19013_001E", # Median HH Income ($)
                  "B02001_002E", # People describing themselves as "white alone"
                  "B06009_005E", # TotalBachelor's degree
                  "B06009_006E", # Total graduate or professional degree
                  "B06012_002E", # Number of below 100 percent of the poverty level
                  "B23025_002E", # Number of people in labor force
                  "B23025_005E") # Number of people in labor force unemployed
                  

# get access to census data of 2020
acsTracts20 <- get_acs(geography = "tract",
                         year = 2020, 
                         variables = census_vars, 
                         geometry = TRUE, 
                         state = 37, 
                         county = 119,
                         output = "wide") %>%
  dplyr::select(NAME, NAME,all_of(census_vars)) %>%
  rename (pop = B01001_001E,
          hu = B25002_001E,
          vacant = B25002_003E,
          med_hh_income = B19013_001E,
          white = B02001_002E,
          bachdeg = B06009_005E,
          graddeg = B06009_006E,
          belpov100 = B06012_002E,
          labor_f = B23025_002E,
          labor_f_unemployed = B23025_005E) %>%
  mutate(pctvacant = vacant/hu,
         pctwhite = white/pop,
         pctunemployed = labor_f_unemployed/labor_f,
         pctbach = bachdeg/pop,
         pctgrad = graddeg/pop,
         raceContext = ifelse(pctwhite > .5, "Majority White", "Majority Non-White")
         ) %>%
    dplyr::select(-vacant,-hu,-white,-labor_f_unemployed,-labor_f,-bachdeg,-graddeg) %>%
    st_as_sf(coords = c("Longitude", "Latitude"), crs = 4326, agr = "constant") %>%
    st_transform('ESRI:102719') %>%
    distinct()
## Getting data from the 2016-2020 5-year ACS
## Downloading feature geometry from the Census website.  To cache shapefiles for use in future sessions, set `options(tigris_use_cache = TRUE)`.
## 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |                                                                      |   1%
  |                                                                            
  |=                                                                     |   2%
  |                                                                            
  |==                                                                    |   2%
  |                                                                            
  |==                                                                    |   3%
  |                                                                            
  |===                                                                   |   4%
  |                                                                            
  |===                                                                   |   5%
  |                                                                            
  |====                                                                  |   5%
  |                                                                            
  |====                                                                  |   6%
  |                                                                            
  |=====                                                                 |   7%
  |                                                                            
  |=====                                                                 |   8%
  |                                                                            
  |======                                                                |   8%
  |                                                                            
  |======                                                                |   9%
  |                                                                            
  |=======                                                               |   9%
  |                                                                            
  |=======                                                               |  10%
  |                                                                            
  |=======                                                               |  11%
  |                                                                            
  |=========                                                             |  12%
  |                                                                            
  |==========                                                            |  14%
  |                                                                            
  |============                                                          |  18%
  |                                                                            
  |==============                                                        |  20%
  |                                                                            
  |===============                                                       |  22%
  |                                                                            
  |================                                                      |  22%
  |                                                                            
  |====================                                                  |  29%
  |                                                                            
  |=====================                                                 |  30%
  |                                                                            
  |=======================                                               |  33%
  |                                                                            
  |========================                                              |  34%
  |                                                                            
  |========================                                              |  35%
  |                                                                            
  |==========================                                            |  37%
  |                                                                            
  |===========================                                           |  39%
  |                                                                            
  |============================                                          |  40%
  |                                                                            
  |=============================                                         |  42%
  |                                                                            
  |==============================                                        |  43%
  |                                                                            
  |===============================                                       |  44%
  |                                                                            
  |===============================                                       |  45%
  |                                                                            
  |================================                                      |  45%
  |                                                                            
  |================================                                      |  46%
  |                                                                            
  |=================================                                     |  47%
  |                                                                            
  |==================================                                    |  49%
  |                                                                            
  |====================================                                  |  51%
  |                                                                            
  |=====================================                                 |  52%
  |                                                                            
  |======================================                                |  54%
  |                                                                            
  |=======================================                               |  56%
  |                                                                            
  |=========================================                             |  58%
  |                                                                            
  |=========================================                             |  59%
  |                                                                            
  |===========================================                           |  62%
  |                                                                            
  |============================================                          |  63%
  |                                                                            
  |==============================================                        |  66%
  |                                                                            
  |===============================================                       |  68%
  |                                                                            
  |======================================================                |  77%
  |                                                                            
  |=======================================================               |  79%
  |                                                                            
  |=========================================================             |  81%
  |                                                                            
  |==========================================================            |  83%
  |                                                                            
  |===========================================================           |  84%
  |                                                                            
  |===========================================================           |  85%
  |                                                                            
  |=============================================================         |  87%
  |                                                                            
  |==============================================================        |  88%
  |                                                                            
  |===============================================================       |  90%
  |                                                                            
  |===============================================================       |  91%
  |                                                                            
  |================================================================      |  91%
  |                                                                            
  |=================================================================     |  93%
  |                                                                            
  |==================================================================    |  95%
  |                                                                            
  |===================================================================   |  95%
  |                                                                            
  |====================================================================  |  97%
  |                                                                            
  |===================================================================== |  98%
  |                                                                            
  |===================================================================== |  99%
  |                                                                            
  |======================================================================| 100%

Add census data to our dataset

alldata <- st_join(data.sf, acsTracts20, join = st_intersects) %>% na.omit()

alldata.sf <- 
  alldata  %>% 
  st_as_sf(coords = c("Longitude", "Latitude"), crs = 4326, agr = "constant") %>%
  st_transform('ESRI:102179')

Summary Statistics

We split the dataset into a training and test set by select “MODELLING” and “CHALLENGE”, then summarize the data with a polished table of our training data set by stargazer .

modelling  <-  alldata.sf[alldata.sf$toPredict  == "MODELLING" ,]
challenge <-  alldata.sf[alldata.sf$toPredict  == "CHALLENGE" ,]
md <- st_drop_geometry(modelling)

stargazer(md, 
          type="text", 
          title = "Main Categories of Decision Factor: Internal Characteristics", 
          keep=c("fullbaths","halfbaths","bedrooms","units","heatedarea","shape_Leng","shape_Area"),
          covariate.labels = c("Number of Fullbaths", "Number of Halfbaths", "Number of Bedrooms", "Number of Unts", "Heated Area", "Length of Shape", "Area of Shape")) 
## 
## Main Categories of Decision Factor: Internal Characteristics
## ========================================================================
## Statistic             N       Mean     St. Dev.     Min         Max     
## ------------------------------------------------------------------------
## Number of Fullbaths 46,009   2.282      0.825        0           9      
## Number of Halfbaths 46,009   0.596      0.528        0           4      
## Number of Bedrooms  46,009   3.510      0.942        0          65      
## Number of Unts      46,009   0.981      0.971        0          205     
## Heated Area         46,009 2,359.218  1,060.543    0.000    14,710.000  
## Length of Shape     46,009  504.231    253.156    152.910   10,798.810  
## Area of Shape       46,009 15,876.130 35,083.170 1,139.637 3,486,865.000
## ------------------------------------------------------------------------
stargazer(md, 
          type="text", 
          title = "Main Categories of Decision Factor: Amenities and public Services", 
          keep=c("college_nn1","church_nn1","park_nn2","transit_nn3"),
          covariate.labels = c("Average Distance Between Each House and 1 Nearest College", "Average Distance Between Each House 1 Nearest Church", "Average Distance Between Each House 2 Nearest Parks", "Average Distance Between Each House 3 Nearest Transit Stops"))
## 
## Main Categories of Decision Factor: Amenities and public Services
## ==========================================================================================================
## Statistic                                                     N       Mean    St. Dev.    Min      Max    
## ----------------------------------------------------------------------------------------------------------
## Average Distance Between Each House and 1 Nearest College   46,009 16,774.630 8,982.394 459.933 54,413.280
## Average Distance Between Each House 1 Nearest Church        46,009 2,932.255  2,097.341 17.465  14,635.710
## Average Distance Between Each House 2 Nearest Parks         46,009 4,941.578  2,574.794 205.557 16,614.240
## Average Distance Between Each House 3 Nearest Transit Stops 46,009 3,013.649  1,405.207 230.756 12,735.250
## ----------------------------------------------------------------------------------------------------------
stargazer(md, 
          type="text", 
          title = "Main Categories of Decision Factor: Spatial Structure", 
          keep=c("pop","med_hh_income","belpov100 ","pctvacant","pctunemployed","pctbach","pctgrad"),
          covariate.labels = c("Total Population 2020", "Median Household Income", "Number of Below 100% of the Poverty Level", "Percent of Vacant Housing Units","Percent of White People", "Percent of Unemployed Labor Force", "Percent of Bachelor's Degree", "Percent of Graduate"))
## 
## Main Categories of Decision Factor: Spatial Structure
## =====================================================================================
## Statistic                                   N       Mean     St. Dev.   Min     Max  
## -------------------------------------------------------------------------------------
## Total Population 2020                     46,009 4,065.882  1,310.872   790    7,703 
## Median Household Income                   46,009 86,187.900 37,427.110 17,685 238,750
## Number of Below 100% of the Poverty Level 46,009   0.058      0.043    0.000   0.293 
## Percent of Vacant Housing Units           46,009   0.046      0.039    0.000   0.225 
## Percent of White People                   46,009   0.214      0.089    0.018   0.460 
## Percent of Unemployed Labor Force         46,009   0.112      0.065    0.004   0.322 
## -------------------------------------------------------------------------------------
facvar <-
  select_if(st_drop_geometry(modelling), is.factor) %>% na.omit()
  
numvar <- 
  select_if(st_drop_geometry(modelling), is.numeric) %>% na.omit()

We use bar plots to visualize our categorical data. In this way, we can see the specific categories of categorical data and also their average price distribution.

catbar <- st_drop_geometry(modelling)
catbar %>% 
  dplyr::select(price, storyheigh, fullbaths, halfbaths, bedrooms, units, heatedfuel, foundation, bldggrade, year) %>%
  filter(price <= 1000000) %>%
  gather(Variable, Value, -price) %>% 
   ggplot(aes(Value, price)) +
     geom_bar(width = 0.5, position = "dodge", stat = "summary", fun.y = "mean", color = "#6A6ACD", fill = "#d5d5e6", size = 0.5) +
     facet_wrap(~Variable, ncol = 5, scales = "free") +
     labs(title = "Price as a function of categorical variables", y = "Mean Price") +
    theme_minimal() + 
  theme(axis.text.x = element_text(size = 5, angle = 45, hjust = 1),
        axis.text.y = element_text(size = 6, hjust = 1)) 
## Warning: attributes are not identical across measure variables;
## they will be dropped
## Warning: Ignoring unknown parameters: fun.y
## No summary function supplied, defaulting to `mean_se()`
## No summary function supplied, defaulting to `mean_se()`
## No summary function supplied, defaulting to `mean_se()`
## No summary function supplied, defaulting to `mean_se()`
## No summary function supplied, defaulting to `mean_se()`
## No summary function supplied, defaulting to `mean_se()`
## No summary function supplied, defaulting to `mean_se()`
## No summary function supplied, defaulting to `mean_se()`
## No summary function supplied, defaulting to `mean_se()`

In the histograms of numeric variables, it can be concluded the distribution of the number of the variables. These sample data are relatively concentrated.

# plot histograms of numeric variables
ggplot(gather(numvar), aes(value)) +
  geom_histogram(color = "#6A6ACD", fill = "#d5d5e6", size = 0.4, bins = 10) +
  facet_wrap(~key, scales = 'free_x') +
      theme_minimal() + 
  scale_color_brewer(palette="Paired") +
  theme(axis.text.x = element_text(size = 4, hjust = 0.5), 
        axis.text.y = element_text(size = 5, hjust = 1))

Correlation

Correlation Matrix

A correlation matrix gives us the pairwise correlation of each set of features in our data. We add the each predictors’R square of correlation matrix in the plot. Our analysis of pairwise correlations between these predictors helps us assess the degree of association between these predictors.

ggcorrplot(
  hc.order = TRUE, 
  outline.color = "white",
  round(cor(numvar), 1), 
  p.mat = cor_pmat(numvar),
  colors = c("#5252B8", "white", "#D0A561"),
  show.diag = TRUE,
  lab = TRUE,
  lab_size = 1.5,
  lab_col = "grey40",
  insig = "blank",
  legend.title = "Correlation",
    ggtheme = ggplot2::theme_gray) + 
    labs(title = "Correlation Matrix Across Numeric Variables") +
  theme(plot.title = element_text(size = 15, hjust = 1),
        axis.text.x = element_text(size = 6, hjust = 1),
        axis.text.y = element_text(size = 6, hjust = 1)
  )

cormax <- cor(numvar)
cormax.df <- as.data.frame(t(cormax))
cormax.df
##                      price    sale_year   fullbaths    halfbaths     bedrooms
## price          1.000000000  0.079026192  0.34011212  0.131834671  0.235757783
## sale_year      0.079026192  1.000000000 -0.07128846 -0.022490794 -0.062591236
## fullbaths      0.340112116 -0.071288464  1.00000000  0.086990730  0.590642509
## halfbaths      0.131834671 -0.022490794  0.08699073  1.000000000  0.219731603
## bedrooms       0.235757783 -0.062591236  0.59064251  0.219731603  1.000000000
## units          0.564840328 -0.030897889 -0.01128456 -0.011204918  0.038262583
## heatedarea     0.410945955 -0.060922192  0.78107220  0.363058899  0.599006145
## shape_Leng     0.207822855  0.005259271  0.10991491 -0.020177353  0.064277882
## shape_Area     0.241012232  0.011950949  0.05509760 -0.009845004  0.034121234
## musaID         0.070050667 -0.001842199  0.09194938  0.003231425  0.059194650
## college_nn1   -0.045648779 -0.002293618  0.12748936  0.052947081  0.087449983
## church_nn1     0.039079512 -0.025614290  0.25297012  0.118962184  0.195238872
## park_nn2       0.001701844 -0.008252172  0.14337608  0.084092197  0.122577698
## transit_nn1   -0.017128469 -0.015827092  0.13570166  0.085266889  0.118865871
## transit_nn2   -0.008168584 -0.021121464  0.16191733  0.097075909  0.133685351
## transit_nn3   -0.004303770 -0.023131241  0.18143453  0.106947987  0.147305722
## transit_nn4   -0.004365708 -0.024820586  0.19489647  0.110088164  0.157061637
## transit_nn5   -0.003995815 -0.025947369  0.20479166  0.113094534  0.164098332
## pop           -0.046395963  0.002708378 -0.01539407  0.046484323 -0.008843498
## med_hh_income  0.274841821 -0.046078510  0.43891142  0.194989126  0.320192021
## belpov100     -0.154080225  0.030067766 -0.31532345 -0.136016727 -0.254836472
## pctvacant      0.062512356  0.012222420 -0.09642390 -0.099033912 -0.108223633
## pctwhite       0.260629526 -0.045237704  0.36817661  0.121605121  0.257981171
## pctunemployed -0.096057104  0.026845805 -0.13854434 -0.087456493 -0.097317435
## pctbach        0.256380828 -0.044373549  0.36170220  0.145187909  0.248258960
## pctgrad        0.293630205 -0.041623920  0.37459292  0.132508751  0.259539491
##                       units  heatedarea   shape_Leng   shape_Area        musaID
## price          0.5648403283  0.41094596  0.207822855  0.241012232  0.0700506669
## sale_year     -0.0308978894 -0.06092219  0.005259271  0.011950949 -0.0018421989
## fullbaths     -0.0112845578  0.78107220  0.109914908  0.055097601  0.0919493834
## halfbaths     -0.0112049183  0.36305890 -0.020177353 -0.009845004  0.0032314254
## bedrooms       0.0382625827  0.59900615  0.064277882  0.034121234  0.0591946497
## units          1.0000000000 -0.01647143  0.072437883  0.131219381 -0.0002289342
## heatedarea    -0.0164714296  1.00000000  0.196317188  0.109066069  0.1082989004
## shape_Leng     0.0724378835  0.19631719  1.000000000  0.816170993  0.0764059148
## shape_Area     0.1312193814  0.10906607  0.816170993  1.000000000  0.0287585953
## musaID        -0.0002289342  0.10829890  0.076405915  0.028758595  1.0000000000
## college_nn1   -0.0182317589  0.13818948  0.092654911  0.062800802  0.1194380659
## church_nn1    -0.0171479846  0.29263009  0.049713334  0.037213723  0.1542252999
## park_nn2      -0.0096192279  0.15095998  0.043414684  0.026102555  0.2755920966
## transit_nn1   -0.0162114019  0.16177056  0.050270320  0.037905608  0.0339315541
## transit_nn2   -0.0156012509  0.19412943  0.058648113  0.044454281  0.0136889346
## transit_nn3   -0.0172987008  0.21778912  0.064555784  0.047479541  0.0026889617
## transit_nn4   -0.0180069778  0.23231658  0.067353529  0.048549636  0.0085976795
## transit_nn5   -0.0184545888  0.24259211  0.069086938  0.049748763  0.0145862701
## pop            0.0076019876 -0.01807662 -0.011582909 -0.006514217  0.0822508389
## med_hh_income -0.0131669154  0.53445562  0.115551231  0.055263270  0.1802322819
## belpov100      0.0119076119 -0.37829025 -0.079051457 -0.047764182 -0.0620536654
## pctvacant      0.0199782838 -0.10234441 -0.004520817 -0.016163991 -0.1381158792
## pctwhite      -0.0020504712  0.43902514  0.145459426  0.069452638  0.1795477574
## pctunemployed  0.0097445317 -0.18238950 -0.042951384 -0.024350076 -0.0673455585
## pctbach       -0.0075398136  0.44153918  0.097848737  0.042869097  0.2087682764
## pctgrad       -0.0052532860  0.45482504  0.088959048  0.034202585  0.1969485604
##                college_nn1  church_nn1     park_nn2 transit_nn1  transit_nn2
## price         -0.045648779  0.03907951  0.001701844 -0.01712847 -0.008168584
## sale_year     -0.002293618 -0.02561429 -0.008252172 -0.01582709 -0.021121464
## fullbaths      0.127489361  0.25297012  0.143376079  0.13570166  0.161917333
## halfbaths      0.052947081  0.11896218  0.084092197  0.08526689  0.097075909
## bedrooms       0.087449983  0.19523887  0.122577698  0.11886587  0.133685351
## units         -0.018231759 -0.01714798 -0.009619228 -0.01621140 -0.015601251
## heatedarea     0.138189476  0.29263009  0.150959982  0.16177056  0.194129427
## shape_Leng     0.092654911  0.04971333  0.043414684  0.05027032  0.058648113
## shape_Area     0.062800802  0.03721372  0.026102555  0.03790561  0.044454281
## musaID         0.119438066  0.15422530  0.275592097  0.03393155  0.013688935
## college_nn1    1.000000000  0.40296112  0.348551657  0.32228523  0.391227215
## church_nn1     0.402961117  1.00000000  0.368271966  0.44614150  0.502067817
## park_nn2       0.348551657  0.36827197  1.000000000  0.31450100  0.326962123
## transit_nn1    0.322285227  0.44614150  0.314500997  1.00000000  0.942931566
## transit_nn2    0.391227215  0.50206782  0.326962123  0.94293157  1.000000000
## transit_nn3    0.437056551  0.55003199  0.338455484  0.88237426  0.976387162
## transit_nn4    0.482845769  0.59051102  0.359278773  0.83596005  0.942056640
## transit_nn5    0.510278926  0.61604949  0.373174007  0.79660854  0.907668470
## pop            0.226750937  0.08218778  0.131024135  0.08929851  0.095035719
## med_hh_income  0.161995791  0.35632855  0.100781382  0.13584099  0.166317178
## belpov100     -0.089592109 -0.25445630 -0.088033206 -0.12074054 -0.140075049
## pctvacant     -0.194911544 -0.21794452 -0.223557436 -0.23140926 -0.238346623
## pctwhite       0.029457238  0.24038648  0.023607898  0.06327968  0.062909534
## pctunemployed -0.063076678 -0.10567463  0.010623030 -0.02914095 -0.033436203
## pctbach        0.027434630  0.20545163 -0.006980639  0.06330046  0.061849261
## pctgrad       -0.085060136  0.14551632 -0.002126861 -0.03090743 -0.037005764
##                transit_nn3  transit_nn4  transit_nn5          pop med_hh_income
## price         -0.004303770 -0.004365708 -0.003995815 -0.046395963    0.27484182
## sale_year     -0.023131241 -0.024820586 -0.025947369  0.002708378   -0.04607851
## fullbaths      0.181434533  0.194896465  0.204791659 -0.015394071    0.43891142
## halfbaths      0.106947987  0.110088164  0.113094534  0.046484323    0.19498913
## bedrooms       0.147305722  0.157061637  0.164098332 -0.008843498    0.32019202
## units         -0.017298701 -0.018006978 -0.018454589  0.007601988   -0.01316692
## heatedarea     0.217789117  0.232316583  0.242592109 -0.018076621    0.53445562
## shape_Leng     0.064555784  0.067353529  0.069086938 -0.011582909    0.11555123
## shape_Area     0.047479541  0.048549636  0.049748763 -0.006514217    0.05526327
## musaID         0.002688962  0.008597679  0.014586270  0.082250839    0.18023228
## college_nn1    0.437056551  0.482845769  0.510278926  0.226750937    0.16199579
## church_nn1     0.550031988  0.590511022  0.616049493  0.082187780    0.35632855
## park_nn2       0.338455484  0.359278773  0.373174007  0.131024135    0.10078138
## transit_nn1    0.882374263  0.835960052  0.796608539  0.089298511    0.13584099
## transit_nn2    0.976387162  0.942056640  0.907668470  0.095035719    0.16631718
## transit_nn3    1.000000000  0.987231698  0.965308251  0.095733820    0.20131727
## transit_nn4    0.987231698  1.000000000  0.993062192  0.097792924    0.22621194
## transit_nn5    0.965308251  0.993062192  1.000000000  0.100241032    0.24530860
## pop            0.095733820  0.097792924  0.100241032  1.000000000    0.01926385
## med_hh_income  0.201317266  0.226211939  0.245308599  0.019263850    1.00000000
## belpov100     -0.159581418 -0.171723114 -0.179943665  0.310517764   -0.59187874
## pctvacant     -0.253638609 -0.267612056 -0.278880197 -0.172003727   -0.18708107
## pctwhite       0.075385813  0.082785878  0.089185935 -0.134766423    0.67233000
## pctunemployed -0.044328613 -0.048908783 -0.052369340  0.023023309   -0.33633678
## pctbach        0.075209724  0.086046069  0.094594506 -0.133811823    0.71181624
## pctgrad       -0.030320478 -0.025687105 -0.020735949 -0.132581658    0.73091306
##                 belpov100    pctvacant     pctwhite pctunemployed      pctbach
## price         -0.15408022  0.062512356  0.260629526  -0.096057104  0.256380828
## sale_year      0.03006777  0.012222420 -0.045237704   0.026845805 -0.044373549
## fullbaths     -0.31532345 -0.096423900  0.368176612  -0.138544338  0.361702203
## halfbaths     -0.13601673 -0.099033912  0.121605121  -0.087456493  0.145187909
## bedrooms      -0.25483647 -0.108223633  0.257981171  -0.097317435  0.248258960
## units          0.01190761  0.019978284 -0.002050471   0.009744532 -0.007539814
## heatedarea    -0.37829025 -0.102344405  0.439025142  -0.182389501  0.441539185
## shape_Leng    -0.07905146 -0.004520817  0.145459426  -0.042951384  0.097848737
## shape_Area    -0.04776418 -0.016163991  0.069452638  -0.024350076  0.042869097
## musaID        -0.06205367 -0.138115879  0.179547757  -0.067345558  0.208768276
## college_nn1   -0.08959211 -0.194911544  0.029457238  -0.063076678  0.027434630
## church_nn1    -0.25445630 -0.217944517  0.240386479  -0.105674629  0.205451632
## park_nn2      -0.08803321 -0.223557436  0.023607898   0.010623030 -0.006980639
## transit_nn1   -0.12074054 -0.231409258  0.063279682  -0.029140948  0.063300460
## transit_nn2   -0.14007505 -0.238346623  0.062909534  -0.033436203  0.061849261
## transit_nn3   -0.15958142 -0.253638609  0.075385813  -0.044328613  0.075209724
## transit_nn4   -0.17172311 -0.267612056  0.082785878  -0.048908783  0.086046069
## transit_nn5   -0.17994367 -0.278880197  0.089185935  -0.052369340  0.094594506
## pop            0.31051776 -0.172003727 -0.134766423   0.023023309 -0.133811823
## med_hh_income -0.59187874 -0.187081070  0.672329999  -0.336336777  0.711816239
## belpov100      1.00000000  0.151029056 -0.542750934   0.314907583 -0.606753394
## pctvacant      0.15102906  1.000000000 -0.106457096   0.008780449 -0.034536516
## pctwhite      -0.54275093 -0.106457096  1.000000000  -0.319582238  0.742082041
## pctunemployed  0.31490758  0.008780449 -0.319582238   1.000000000 -0.422797700
## pctbach       -0.60675339 -0.034536516  0.742082041  -0.422797700  1.000000000
## pctgrad       -0.56330874 -0.007865324  0.683025033  -0.408132927  0.750232475
##                    pctgrad
## price          0.293630205
## sale_year     -0.041623920
## fullbaths      0.374592925
## halfbaths      0.132508751
## bedrooms       0.259539491
## units         -0.005253286
## heatedarea     0.454825044
## shape_Leng     0.088959048
## shape_Area     0.034202585
## musaID         0.196948560
## college_nn1   -0.085060136
## church_nn1     0.145516320
## park_nn2      -0.002126861
## transit_nn1   -0.030907432
## transit_nn2   -0.037005764
## transit_nn3   -0.030320478
## transit_nn4   -0.025687105
## transit_nn5   -0.020735949
## pop           -0.132581658
## med_hh_income  0.730913061
## belpov100     -0.563308743
## pctvacant     -0.007865324
## pctwhite       0.683025033
## pctunemployed -0.408132927
## pctbach        0.750232475
## pctgrad        1.000000000

At this time, we are going to delete collinear independent(R srquare>0.75 or <-0.75).

We choose the length of shape, distance to nearest 3 transit stops, percent of graduate, and move the shape area, percent bachelor degree and distance to nearest others transit stops out at the same time.

cormaxsel <-
  (cormax.df < 0.75 &cormax.df > -0.75 )*cormax.df

cormaxsel2 <- 
  which(t(t(cormaxsel) ==0), arr.ind=TRUE)
cormaxsel2[,"row"] <- rownames(cormaxsel)[as.numeric(cormaxsel2[,"row"])]
cormaxsel2[,"col"] <- colnames(cormaxsel)[as.numeric(cormaxsel2[,"col"])]


for (i in(1:nrow(cormaxsel2))){
  if(cormaxsel2[i,1] != cormaxsel2[i,2]){
  print(cormaxsel2[i,])
  }
}
##          row          col 
## "heatedarea"  "fullbaths" 
##          row          col 
##  "fullbaths" "heatedarea" 
##          row          col 
## "shape_Area" "shape_Leng" 
##          row          col 
## "shape_Leng" "shape_Area" 
##           row           col 
## "transit_nn2" "transit_nn1" 
##           row           col 
## "transit_nn3" "transit_nn1" 
##           row           col 
## "transit_nn4" "transit_nn1" 
##           row           col 
## "transit_nn5" "transit_nn1" 
##           row           col 
## "transit_nn1" "transit_nn2" 
##           row           col 
## "transit_nn3" "transit_nn2" 
##           row           col 
## "transit_nn4" "transit_nn2" 
##           row           col 
## "transit_nn5" "transit_nn2" 
##           row           col 
## "transit_nn1" "transit_nn3" 
##           row           col 
## "transit_nn2" "transit_nn3" 
##           row           col 
## "transit_nn4" "transit_nn3" 
##           row           col 
## "transit_nn5" "transit_nn3" 
##           row           col 
## "transit_nn1" "transit_nn4" 
##           row           col 
## "transit_nn2" "transit_nn4" 
##           row           col 
## "transit_nn3" "transit_nn4" 
##           row           col 
## "transit_nn5" "transit_nn4" 
##           row           col 
## "transit_nn1" "transit_nn5" 
##           row           col 
## "transit_nn2" "transit_nn5" 
##           row           col 
## "transit_nn3" "transit_nn5" 
##           row           col 
## "transit_nn4" "transit_nn5" 
##       row       col 
## "pctgrad" "pctbach" 
##       row       col 
## "pctbach" "pctgrad"

Analyzing Associations

We chose four factors that we thought were most relevant to house price, but it appears that school has little effect on home prices. But I wonder what will change in the multi-factor model afterwards.

numplot1 <- st_drop_geometry(numvar)
numplot1 %>%
  dplyr::select(price, pctgrad, park_nn2, med_hh_income, pctvacant) %>%
  filter(price <= 1000000) %>%
  gather(Variable, Value, -price) %>% 
   ggplot(aes(Value, price)) +
  geom_point(size = .2,alpha = .1, colour = "#2d2d42") + 
  geom_smooth(method = "lm", se=F, colour = "#dea243") +
  facet_wrap(~Variable, ncol = 2, scales = "free") +
  labs(title = "Price As a Function of Continuous Variables") +
     theme_nice()
## `geom_smooth()` using formula 'y ~ x'

numplot2 <- st_drop_geometry(numvar)
numplot2 %>%
  dplyr::select(price, transit_nn1, transit_nn2, transit_nn3, transit_nn4, transit_nn5) %>%
  filter(price <= 1000000) %>%
  gather(Variable, Value, -price) %>% 
   ggplot(aes(Value, price)) +
  geom_point(size = .2,alpha = .05, colour = "#2d2d42") + 
  geom_smooth(method = "lm", se=F, colour = "#dea243") +
     facet_wrap(~Variable, ncol = 3, scales = "free") +
     labs(title = "Price as a function of continuous variables") +
  theme(axis.text.x = element_text(size = 5, hjust = 1), 
        axis.text.y = element_text(size = 2, hjust = 1)) +
  theme_nice()
## `geom_smooth()` using formula 'y ~ x'

Mapping

Sale Price Map

Below is a mapping of home sales prices in Mecklenburge. The yellow part means high price, and the purple part means low price. So the high price area mainly concentrate in the north and the south of the Mecklenburge.

ggplot() +
  geom_sf(data = bd, fill = "grey25", color = "grey50", size = 0.2) +
  geom_sf(data = data.sf, aes(colour = q5(price)), 
          show.legend = "point", size = .1, alpha = .1) +
  scale_colour_manual(values = palette5,
                      labels=qBr(data,"price"),
                      name="Home Sale Price in US Dollars\n(Quintile Breaks)") +
  labs(title="Home Sale Price in Mecklenburge, NC") +
  theme(plot.title = element_text(hjust = 0.5)) +
mapTheme()

3 Interesting Maps

Below are three mappings of interesting independent variables. Actually, the distribution are kind of similar in spatial area. But the density in some areas are different. The distribution of average of income are concentrated. But the heated area has some high values in the north of the city.

ggplot() +
  geom_sf(data = bd, fill = "grey25", color = "grey50", size = 0.2) +
  geom_sf(data = modelling, aes(colour = q5(heatedarea)), 
          show.legend = "points", size = .1, alpha = .1) +
  scale_colour_manual(values = palette5,
                      labels=qBr(modelling,"heatedarea"),
                      name="Heated Area\n(QuintileBreaks)") +
  labs(title="Heated Area Distribution") +
  theme(plot.title = element_text(hjust = 0.5)) +
  mapTheme()

ggplot() +
  geom_sf(data = bd, fill = "grey25", color = "grey50", size = 0.2) +
  geom_sf(data = modelling, aes(colour = q5(pop)), 
          show.legend = "points", size = .1, alpha = .1) +
  scale_colour_manual(values = palette5,
                      labels=qBr(modelling,"pop"),
                      name="Population\n(QuintileBreaks)") +
  labs(title="Population Distribution") +
  theme(plot.title = element_text(hjust = 0.5)) +
  mapTheme()

ggplot() +
  geom_sf(data = bd, fill = "grey25", color = "grey50", size = 0.2) +
  geom_sf(data = modelling, aes(colour = q5(med_hh_income)), 
          show.legend = "points", size = .1, alpha = .1) +
  scale_colour_manual(values = palette5,
                      labels=qBr(modelling,"med_hh_income"),
                      name="Income in US Dollars\n(QuintileBreaks)") +
  labs(title="Average  Income of Household Each House") +
  theme(plot.title = element_text(hjust = 0.5)) +
  mapTheme()

Methodology

Machine Learning is awesome and powerful, but it can also appear incredibly complicated. The process typically has five stages: namely data wrangling, exploratory analysis, feature engineering, feature selection as well as model estimation and validation.

The first method of evaluating the model: the accuracy. The second method of evaluating models: the generalizability. What is Generalizability? The results are widely applicable to different types of data.

In order to improve the accuracy, that is, to reduce the difference between the predicted and observed house prices, we do some data cleaning, including removing impossible data. In addition, we still need to see the correlation between predictors by using ggcorrplot() to plot a correlation matrix.

We will use the createDataPartition function to randomly split the current data into 70% of the training and 30% of the test data set. To train the model, We created a training data set (data set with known sale prices) which included 45581 observations to “train” the regression model to predict the home sales price in the test data set (the data set of homes with unknown, or 0, prices).

Secondly, the errors! We use the MAE, MAPE, of a randomly selected test set to see how much our predictions differ from the observed home selling prices, and to see how good the model is. The smaller the MAPE, the better the model. We useabs()function to calculate the error, MAE and MAPE, use mean() function to calculate the average of MAE and MAPE.

Thirdly, In evaluating the predictive ability of our model, We got two separate approaches. The first approach was “In-sample training”, in which I divided the modelling data-set into two groups(M.test and M.train), and used one of the groups to predict the prices in the other. The second approach was a K-fold cross-validation which randomly divids the training set into ten equal “folds”, and one by one predicts prices for each of the folds using the remaining nine folds. K-fold cross-validation is a good way to measure the performance of a model on new data.

Next, Moran’s I is a method to examine whether the model was sufficiently capturing spatial structure of prices. Meanwhile, it can indicate some spatial dynamic that was not accounted for in the mode.

Finding the residuals to be spatially correlated, we use the partitioned data from the census data as the neighborhood data. We added a new spatial feature to the model, hoping to balance the spatial correlation in the residuals. Comparing the MAPE of the model with the two predictions, the MAPE of the new model is reduced. This indicates that this spatial feature has an effect in the model.

Finally, we used generalizability to test the applicability of the model. Using different partitioning methods, we verify that the model can predict accurately. Comparing the change of MAPE values of these two mods, the smaller the change value is, the more applicable the model is.

Results

First OLS Regression Model

Thus, we randomly select 70% of the home sale price observations from the dataset as the training set, and 30% of the dataset as the testing dataset bycreateDataPartition.

modelling_sub_200k <- modelling %>% 
filter(price <= 2000000)

inTrain <- createDataPartition(
              y = paste(modelling$NAME,modelling$sale_year,modelling$storyheigh,modelling$fullbaths,modelling$halfbaths,modelling$bedrooms,modelling$units,modelling$heatedarea,modelling$heatedfuel,modelling$foundation,modelling$bldggrade,modelling$year), 
              p = .70, list = FALSE)
## Warning in createDataPartition(y = paste(modelling$NAME, modelling$sale_year, :
## Some classes have a single record ( Census Tract 1.02, Mecklenburg County, North
## Carolina 2021 1 STORY 1 0 2 1 913 GAS CRAWL SPACE AVERAGE 1, Census Tract 10,
## Mecklenburg County, North Carolina 2020 1 STORY 1 0 2 1 1034 GAS CRAWL SPACE
## AVERAGE 1, Census Tract 10, Mecklenburg County, North Carolina 2020 1 STORY 1 0
## 2 1 1092 GAS CRAWL SPACE AVERAGE 1, Census Tract 10, Mecklenburg County, North
## Carolina 2020 1 STORY 1 0 2 1 868 GAS CRAWL SPACE AVERAGE 1, Census Tract 10,
## Mecklenburg County, North Carolina 2020 1 STORY 1 0 2 1 972 GAS CRAWL SPACE
## AVERAGE 1, Census Tract 10, Mecklenburg County, North Carolina 2020 1 STORY 1 0
## 3 1 1040 GAS CRAWL SPACE AVERAGE 1, Census Tract 10, Mecklenburg County, North
## Carolina 2020 1 STORY 1 0 3 1 1106 GAS CRAWL SPACE AVERAGE 1, Census Tract 10,
## Mecklenburg County, North Carolina 2020 1 STORY 1 0 3 1 1114 GAS CRAWL SPACE
## GOOD 1, Census Tract 10, Mecklenburg County, North Carolina 2020 1 STORY 1 0
## 3 1 1180 GAS CRAWL SPACE AVERAGE 1, Census Tract 10, Mecklenburg County, North
## Carolina 2020 1 STORY 1 0 3 1 1248 GAS CRAWL SPACE AVERAGE 1, Census Tract 10,
## Mecklenburg County, North Carolina 2020 1 STORY 1 0 3 1 1263 GAS CRAWL SPACE
## AVERAGE 1, Census Tract 10, Mecklenburg County, North Carolina 2020 1 STORY 1 0
## 3 1 1272 GAS CRAWL SPACE AVERAGE 1, Census Tract 10, Mecklenburg County, North
## Carolina 2020 1 STORY 1 0 3 1 1324 GAS CRAWL SPACE AVERAGE 1, Census Tract 10,
## Mecklenburg County, North Carolina 2020 1 STORY 1 0 3 1 1352 GAS CRAWL SPACE
## AVERAGE 1, Census Tract 10, Mecklenburg County, North Carolina 2020 1 STORY 1 0
## 3 1 1446 GAS CRAWL SPACE AVERAGE 1, Census Tract 10, Mecklenburg County, North
## Carolina 2020 1 STORY 1 0 3 1 1494 GAS CRAWL SPACE AVERAGE 1, Census Tract 10,
## Mecklenburg County, North Carolina 2020 1 STORY 1 1 2 1 1717 GAS CRAWL SPACE
## AVERAGE 1, Census Tract 10, Mecklenburg County, North Carolina 2020 1 STORY 1 1
## 2 1 1800 GAS CRAWL SPACE AVERAGE 1, Census Tract 10, Mecklenburg County, North
## Carolina 2020 1 STORY 2 0 2 1 1170 GAS CRAWL SPACE AVERAGE 1, Census Tract 10,
## Mecklenburg County, North Carolina 2020 1 STORY 2 0 2 1 1287 GAS CRAWL SPACE
## GOOD 1, Census Tract 10, Mecklenburg County, North Carolina 2020 1 STORY 2 0
## 2 2 1776 GAS CRAWL SPACE AVERAGE 1, Census Tract 10, Mecklenburg County, North
## Carolina 2020 1 STORY 2 0 3 1 1200 GAS CRAWL SPACE AVERAGE 1, Census Tract 10,
## Mecklenburg County, North Carolina 2020 1 STORY 2 0 3 1 1248 GAS CRAWL SPACE
## AVERAGE 1, Census Tract 10, Mecklenburg County, North Carolina 2020 1 STORY 2 0
## 3 1 1359 GAS CRAWL SPACE AVERAGE 1, Census Tract 10, Mecklenburg County, North
## Carolina 2020 1 STORY 2 0 3 1 1514 GAS CRAWL SPACE AVERAGE 1, Census Tract 10,
## Mecklenburg County, North Carolina 2020 1 STORY 2 0 3 1 1535 GAS CRAWL SPACE
## GOOD 1, Census Tract 10, Mecklenburg County, North Carolina 2020 1 STORY 2 0
## 3 1 1644 ELECTRIC CRAWL SPACE AVERAGE 2, Census Tract 10, Mecklenburg County,
## North Carolina 2020 1 STORY 2 0 3 1 1711 GAS CRAWL SPACE GOOD 1, Census Tract
## 10, Mecklenburg County, North Carolina 2020 1 STORY 2 0 3 1 1732 GAS CRAWL SPACE
## GOOD 1, Census Tract 10, Mecklenburg County, North Carolina 2020 1 STORY 2 0 3 1
## 1764 GAS CRAWL SPACE GOOD 1, Census Tract 10, Mecklenburg County, North Carolina
## 2020 1 STORY 2 0 3 1 2243 GAS CRAWL SPACE GOOD 1, Census Tract 10, Mecklenburg
## County, North Carolina 2020 1 STORY 2 0 3 1 2300 GAS CRAWL SPACE GOOD 1, Census
## Tract 10, Mecklenburg County, North Carolina 2020 1 STORY 2 0 3 1 2410 GAS CRAWL
## SPACE GOOD 1, Census Tract 10, Mecklenburg County, North Carolina 2020 1 STORY
## 3 0 3 1 1530 GAS BASEMENT AVERAGE 1, Census Tract 10, Mecklenburg County, North
## Carolina 2020 1 STORY 3 0 3 1 1839 GAS CRAWL SPACE AVERAGE 1, Census Tract 10,
## Mecklenburg County, North Carolina 2020 1 STORY 3 0 3 1 2155 GAS CRAWL SPACE
## GOOD 1, Census Tract 10, Mecklenburg County, North Carolina 2020 1.5 STORY 1 0
## 2 1 1251 GAS CRAWL SPACE AVERAGE 1, Census Tract 10, Mecklenburg County, North
## Carolina 2020 1.5 STORY 2 0 3 1 1478 GAS CRAWL SPACE AVERAGE 1, Census Tract 10,
## Mecklenburg County, North Carolina 2020 1.5 STORY 2 0 3 1 1812 GAS CRAWL SPACE
## AVERAGE 1, Census Tract 10, Mecklenburg County, North Carolina 2020 1.5 STORY
## 2 0 3 1 2416 GAS CRAWL SPACE GOOD 2, Census Tract 10, Mecklenburg County, North
## Carolina 2020 1.5 STORY 2 0 4 1 1665 GAS CRAWL SPACE GOOD 1, Census Tract 10,
## Mecklenburg County, North Carolina 2020 1.5 STORY 3 0 4 1 1975 GAS CRAWL SPACE
## AVERAGE 1, Census Tract 10, Mecklenburg County, North Carolina 2020 1.5 STORY 3
## 0 4 1 1984 GAS CRAWL SPACE AVERAGE 2, Census Tract 10, Mecklenburg County, North
## Carolina 2020 1.5 STORY 3 0 4 1 2318 GAS CRAWL SPACE AVERAGE 1, Census Tract 10,
## Mecklenburg County, North Carolina 2020 1.5 STORY 3 1 3 1 2573 GAS CRAWL SPACE
## GOOD 1, Census Tract 10, Mecklenburg County, North Carolina 2020 1.5 STORY 4
## 0 4 1 2724 GAS CRAWL SPACE GOOD 1, Census Tract 10, Mecklenburg County, North
## Carolina 2020 2.0 STORY 2 0 3 1 1504 GAS CRAWL SPACE AVERAGE 1, Census Tract 10,
## Mecklenburg County, North Carolina 2020 2.0 STORY 2 0 3 1 1940 GAS CRAWL SPACE
## GOOD 1, Census Tract 10, Mecklenburg County, North Carolina 2020 2.0 STORY 2 0
## 3 1 2026 GAS CRAWL SPACE AVERAGE 2, Census Tract 10, Mecklenburg County, North
## Carolina 2020 2.0 STORY 2 1 3 1 1856 GAS CRAWL SPACE GOOD 3, Census Tract 10,
## Mecklenburg County, North Carolina 2020 2.0 STORY 2 1 3 1 1928 GAS CRAWL SPACE
## GOOD 3, Census Tract 10, Mecklenburg County, North Carolina 2020 2.0 STORY 2
## 1 3 1 2311 GAS CRAWL SPACE GOOD 3, Census Tract 10, Mecklenburg County, North
## Carolina 2020 2.0 STORY 2 1 3 1 2329 GAS CRAWL SPACE GOOD 4, Census Tract 10,
## Mecklenburg County, North Carolina 2020 2.0 STORY 2 1 3 1 2418 GAS CRAWL SPACE
## GOOD 3, Census Tract 10, Mecklenburg County, North Carolina 2020 2.0 STORY
## 2 1 3 1 2436 GAS BASEMENT GOOD 3, Census Tract 10, Mecklenburg County, North
## Carolina 2020 2.0 STORY 2 1 3 1 2742 GAS CRAWL SPACE AVERAGE 1, Census Tract
## 10, Mecklenburg County, North Carolina 2020 2.0 STORY 2 1 4 1 2040 GAS CRAWL
## SPACE GOOD 4, Census Tract 10, Mecklenburg County, North Carolina 2020 2.0 STORY
## 2 1 4 1 2192 GAS CRAWL SPACE AVERAGE 1, Census Tract 10, Mecklenburg County,
## North Carolina 2020 2.0 STORY 3 0 3 1 2042 GAS SLAB-RES GOOD 4, Census Tract 10,
## Mecklenburg County, North Carolina 2020 2.0 STORY 3 0 3 1 2690 GAS CRAWL SPACE
## VERY GOOD 4, Census Tract 10, Mecklenburg County, North Carolina 2020 2.0 STORY
## 3 0 3 1 2717 GAS CRAWL SPACE AVERAGE 1, Census Tract 10, Mecklenburg County,
## North Carolina 2020 2.0 STORY 3 0 4 1 2496 GAS CRAWL SPACE GOOD 3, Census Tract
## 10, Mecklenburg County, North Carolina 2020 2.0 STORY 3 0 4 1 2708 GAS BASEMENT
## VERY GOOD 4, Census Tract 10, Mecklenburg County, North Carolina 2020 2.0 STORY
## 3 0 4 1 3210 GAS CRAWL SPACE GOOD 3, Census Tract 10, Mecklenburg County, North
## Carolina 2020 2.0 STORY 3 1 4 1 2164 GAS CRAWL SPACE AVERAGE 1, Census Tract 10,
## Mecklenburg County, North Carolina 2020 2.0 STORY 3 1 4 1 4002 GAS CRAWL SPACE
## VERY GOOD 4, Census Tract 10, Mecklenburg County, North Carolina 2020 2.0 STORY
## 4 0 4 1 4034 GAS CRAWL SPACE VERY GOOD 4, Census Tract 10, Mecklenburg County,
## North Carolina 2020 2.0 STORY 4 1 4 1 3844 GAS CRAWL SPACE GOOD 3, Census Tract
## 10, Mecklenburg County, North Carolina 2020 2.0 STORY 4 1 50 3 4016 GAS CRAWL
## SPACE AVERAGE 1, Census Tract 10, Mecklenburg County, North Carolina 2020 2.5
## STORY 2 0 3 1 897 GAS CRAWL SPACE GOOD 1, Census Tract 10, Mecklenburg County,
## North Carolina 2020 2.5 STORY 2 1 3 1 1930 GAS BASEMENT AVERAGE 2, Census Tract
## 10, Mecklenburg County, North Carolina 2020 2.5 STORY 3 1 3 1 2082 GAS CRAWL
## SPACE GOOD 3, Census Tract 10, Mecklenburg County, North Carolina 2020 2.5 STORY
## 3 1 4 1 2528 GAS CRAWL SPACE GOOD 3, Census Tract 10, Mecklenburg County, North
## Carolina 2020 2.5 STORY 3 1 4 1 2554 GAS SLAB-RES VERY GOOD 4, Census Tract 10,
## Mecklenburg County, North Carolina 2020 2.5 STORY 3 1 4 1 3205 GAS CRAWL SPACE
## GOOD 4, Census Tract 10, Mecklenburg County, North Carolina 2020 2.5 STORY 3 1
## 5 1 3690 GAS CRAWL SPACE VERY GOOD 4, Census Tract 10, Mecklenburg County, North
## Carolina 2020 RANCH W/BSMT 1 0 2 1 1188 GAS BASEMENT AVERAGE 1, Census Tract 10,
## Mecklenburg County, North Carolina 2020 RANCH W/BSMT 2 0 3 1 1
M.train.sf <- modelling_sub_200k[inTrain,] 
M.test.sf <- modelling_sub_200k[-inTrain,] 

M.train <- st_drop_geometry(M.train.sf) 
M.test <- st_drop_geometry(M.test.sf)


reg1 <- lm(price ~ ., data = M.train %>% 
                    dplyr::select(price,sale_year,storyheigh,fullbaths, halfbaths,bedrooms,units,heatedarea,heatedfuel,foundation,bldggrade,shape_Leng,year,college_nn1,church_nn1,park_nn2,transit_nn3,pop,med_hh_income,belpov100,pctvacant,pctunemployed,pctgrad))

stargazer(reg1, type = "text",title="Regression Results",align=TRUE,no.space=TRUE)
## 
## Regression Results
## ==========================================================
##                                   Dependent variable:     
##                              -----------------------------
##                                          price            
## ----------------------------------------------------------
## sale_year                            72,727.750***        
##                                        (801.917)          
## storyheigh1 STORY                     39,632.020          
##                                      (73,299.900)         
## storyheigh1.5 STORY                   53,759.030          
##                                      (73,315.870)         
## storyheigh2.0 STORY                   25,202.240          
##                                      (73,285.760)         
## storyheigh2.5 STORY                    -306.798           
##                                      (73,372.490)         
## storyheigh3.0 STORY                   -36,354.800         
##                                      (74,874.900)         
## storyheighBI-LEVEL                     4,891.525          
##                                      (73,951.700)         
## storyheighCAPE COD                    48,841.770          
##                                      (78,865.910)         
## storyheighRANCH W/BSMT                20,904.500          
##                                      (73,544.120)         
## storyheighSPLIT LEVEL                  1,852.319          
##                                      (73,422.810)         
## fullbaths                            41,591.200***        
##                                       (1,379.076)         
## halfbaths                            21,505.890***        
##                                       (1,551.350)         
## bedrooms                             -7,214.694***        
##                                        (848.521)          
## units                                22,438.570***        
##                                       (3,294.094)         
## heatedarea                             87.409***          
##                                         (1.299)           
## heatedfuelGAS                        10,552.080***        
##                                       (1,890.182)         
## heatedfuelNONE                      -55,814.460***        
##                                      (15,729.220)         
## heatedfuelOIL/WD/COAL               -75,751.410***        
##                                       (8,445.143)         
## heatedfuelSOLAR/GEOTHRM               13,537.940          
##                                      (42,344.900)         
## foundationCRAWL SPACE                16,577.070***        
##                                       (4,726.464)         
## foundationPIER-NO FOUND WALL        233,805.300***        
##                                      (56,921.550)         
## foundationSLAB-ABV GRD               46,763.020**         
##                                      (19,306.840)         
## foundationSLAB-COM                    -29,284.830         
##                                      (56,917.750)         
## foundationSLAB-RES                     6,888.206          
##                                       (4,847.263)         
## bldggradeCUSTOM                     726,381.600***        
##                                      (14,195.270)         
## bldggradeEXCELLENT                  575,308.400***        
##                                       (5,507.844)         
## bldggradeFAIR                       -60,136.340***        
##                                       (5,104.902)         
## bldggradeGOOD                        84,045.900***        
##                                       (1,756.326)         
## bldggradeMINIMUM                    -88,184.070***        
##                                      (28,496.280)         
## bldggradeVERY GOOD                  265,346.600***        
##                                       (2,881.778)         
## shape_Leng                             68.910***          
##                                         (2.881)           
## year1                               879,996.100***        
##                                      (126,920.200)        
## year2                               824,306.400***        
##                                      (126,914.900)        
## year3                               829,367.800***        
##                                      (126,912.100)        
## year4                               837,279.800***        
##                                      (126,902.000)        
## college_nn1                            -2.414***          
##                                         (0.082)           
## church_nn1                             -4.628***          
##                                         (0.377)           
## park_nn2                               1.056***           
##                                         (0.268)           
## transit_nn3                            -2.115***          
##                                         (0.560)           
## pop                                    -3.748***          
##                                         (0.530)           
## med_hh_income                          0.797***           
##                                         (0.029)           
## belpov100                              33.513***          
##                                         (2.460)           
## pctvacant                           548,400.100***        
##                                      (15,452.610)         
## pctunemployed                         14,073.490          
##                                      (17,237.010)         
## pctgrad                             616,774.500***        
##                                      (16,319.520)         
## Constant                          -147,928,148.000***     
##                                     (1,627,206.000)       
## ----------------------------------------------------------
## Observations                            45,138            
## R2                                       0.770            
## Adjusted R2                              0.770            
## Residual Std. Error            126,743.500 (df = 45092)   
## F Statistic                  3,360.218*** (df = 45; 45092)
## ==========================================================
## Note:                          *p<0.1; **p<0.05; ***p<0.01

Floorstoryheight was removed to reduce errors and improve the model, and 23 predictor variables were included in the regression model training, including 10 categorical variables.

Second OLS Regression Model

reg2 <- lm(price ~ ., data = M.train %>% 
                    dplyr::select(price,sale_year,fullbaths, halfbaths,bedrooms,units,heatedarea,heatedfuel,foundation,bldggrade,shape_Leng,year,college_nn1,church_nn1,park_nn2,transit_nn3,pop,med_hh_income,belpov100,pctvacant,pctunemployed,pctgrad))

Table of Results

stargazer(reg2, 
          type = "text",
          title="Regression Results",
          covariate.labels = c("Sale Year", "Number of Fullbaths", "Number of Halfbaths", "Number of Bedrooms", "Number of Units", "Heated Area", "Heated Fuel: GAS", "Heated Fuel: None", "Heated Fuel: OIL or WD or COAL", "Heated Fuel: SOLAR or GEOTHRM", "Foundation: CRAWL SPACE", "Foundation PIER-NO FOUND WALL", "Foundation SLAB-ABV GRD", "Foundation SLAB-COM", "Foundation SLAB-RES", "Building Grade: Custom", "Building Grade: EXCELLENT", "Building Grade: FAIR", "Building Grade: GOOD", "Building Grade: MINIMUM", "Building Grade: VERY GOOD", "Lengh of Shape", "Before 1979", "Between 1979 and 1999", "Between 1999 and 2015", "Between 2015 and 2022", "Average Distance Between Each House and 1 Nearest College", "Average Distance Between Each House 1 Nearest Church", "Average Distance Between Each House 2 Nearest Parks", "Average Distance Between Each House 3 Nearest Transit Stops", "Total Population 2020", "Median Household Income", "Number of Below 100% of the Poverty Level", "Percent of Vacant Housing Units","Percent of White People", "Percent of Unemployed Labor Force", "Percent of Bachelor's Degree", "Percent of Graduate"))
## 
## Regression Results
## =========================================================================================
##                                                                  Dependent variable:     
##                                                             -----------------------------
##                                                                         price            
## -----------------------------------------------------------------------------------------
## Sale Year                                                           72,638.790***        
##                                                                       (804.272)          
##                                                                                          
## Number of Fullbaths                                                 39,440.120***        
##                                                                      (1,362.375)         
##                                                                                          
## Number of Halfbaths                                                 15,231.520***        
##                                                                      (1,376.038)         
##                                                                                          
## Number of Bedrooms                                                  -8,642.240***        
##                                                                       (842.505)          
##                                                                                          
## Number of Units                                                     22,819.110***        
##                                                                      (3,302.940)         
##                                                                                          
## Heated Area                                                           85.443***          
##                                                                        (1.277)           
##                                                                                          
## Heated Fuel: GAS                                                    11,444.350***        
##                                                                      (1,891.558)         
##                                                                                          
## Heated Fuel: None                                                  -56,246.750***        
##                                                                     (15,772.990)         
##                                                                                          
## Heated Fuel: OIL or WD or COAL                                     -75,370.100***        
##                                                                      (8,465.220)         
##                                                                                          
## Heated Fuel: SOLAR or GEOTHRM                                        14,074.410          
##                                                                     (42,474.460)         
##                                                                                          
## Foundation: CRAWL SPACE                                             17,020.190***        
##                                                                      (4,645.492)         
##                                                                                          
## Foundation PIER-NO FOUND WALL                                      236,991.500***        
##                                                                     (57,089.400)         
##                                                                                          
## Foundation SLAB-ABV GRD                                             52,496.030***        
##                                                                     (19,338.010)         
##                                                                                          
## Foundation SLAB-COM                                                  -30,645.940         
##                                                                     (57,084.520)         
##                                                                                          
## Foundation SLAB-RES                                                  8,001.614*          
##                                                                      (4,764.142)         
##                                                                                          
## Building Grade: Custom                                             731,030.600***        
##                                                                     (14,209.490)         
##                                                                                          
## Building Grade: EXCELLENT                                          577,875.600***        
##                                                                      (5,480.953)         
##                                                                                          
## Building Grade: FAIR                                               -60,348.220***        
##                                                                      (5,113.535)         
##                                                                                          
## Building Grade: GOOD                                                85,761.580***        
##                                                                      (1,758.405)         
##                                                                                          
## Building Grade: MINIMUM                                            -88,463.310***        
##                                                                     (28,581.120)         
##                                                                                          
## Building Grade: VERY GOOD                                          267,551.100***        
##                                                                      (2,875.308)         
##                                                                                          
## Lengh of Shape                                                        72.398***          
##                                                                        (2.872)           
##                                                                                          
## Before 1979                                                        878,177.300***        
##                                                                     (127,307.500)        
##                                                                                          
## Between 1979 and 1999                                              822,189.000***        
##                                                                     (127,304.900)        
##                                                                                          
## Between 1999 and 2015                                              826,258.500***        
##                                                                     (127,302.900)        
##                                                                                          
## Between 2015 and 2022                                              835,121.200***        
##                                                                     (127,292.600)        
##                                                                                          
## Average Distance Between Each House and 1 Nearest College             -2.421***          
##                                                                        (0.082)           
##                                                                                          
## Average Distance Between Each House 1 Nearest Church                  -4.519***          
##                                                                        (0.378)           
##                                                                                          
## Average Distance Between Each House 2 Nearest Parks                   1.066***           
##                                                                        (0.269)           
##                                                                                          
## Average Distance Between Each House 3 Nearest Transit Stops           -2.001***          
##                                                                        (0.561)           
##                                                                                          
## Total Population 2020                                                 -3.649***          
##                                                                        (0.531)           
##                                                                                          
## Median Household Income                                               0.813***           
##                                                                        (0.029)           
##                                                                                          
## Number of Below 100% of the Poverty Level                             32.933***          
##                                                                        (2.465)           
##                                                                                          
## Percent of Vacant Housing Units                                    556,626.200***        
##                                                                     (15,484.520)         
##                                                                                          
## Percent of White People                                              11,668.010          
##                                                                     (17,285.210)         
##                                                                                          
## Percent of Unemployed Labor Force                                  617,399.100***        
##                                                                     (16,348.200)         
##                                                                                          
## Percent of Bachelor's Degree                                     -147,704,199.000***     
##                                                                    (1,630,538.000)       
##                                                                                          
## -----------------------------------------------------------------------------------------
## Observations                                                           45,138            
## R2                                                                      0.769            
## Adjusted R2                                                             0.769            
## Residual Std. Error                                           127,137.400 (df = 45101)   
## F Statistic                                                 4,166.285*** (df = 36; 45101)
## =========================================================================================
## Note:                                                         *p<0.1; **p<0.05; ***p<0.01

Calculate MAE and MAPE

alldata.sf <-
  alldata.sf %>%
  mutate(Price.Predict = predict(reg2, alldata.sf))
M.test.sf <-
  M.test.sf %>%
  mutate(Price.Predict = predict(reg2, M.test.sf),
         Price.Error = Price.Predict - price,
         Price.AbsError = abs(Price.Predict - price),
         Price.APE = (abs(Price.Predict - price)) / Price.Predict)
Test_MAE <- mean(M.test.sf$Price.AbsError, na.rm = T)
Test_MAPE <- mean(M.test.sf$Price.APE, na.rm = T)

M.train.sf <-
  M.train.sf %>%
  mutate(Price.Predict = predict(reg2, M.train.sf),
         Price.Error = Price.Predict - price,
         Price.AbsError = abs(Price.Predict - price),
         Price.APE = (abs(Price.Predict - price)) / Price.Predict)

Training_MAE <- mean(M.train.sf$Price.AbsError, na.rm = T)
Training_MAPE <- mean(M.train.sf$Price.APE, na.rm = T)

error <- data.frame(Training_MAE, Training_MAPE, Test_MAE, Test_MAPE)
error %>%
  kable() %>% 
  kable_material(bootstrap_options = "responsive",font_size = 20, full_width = F)
Training_MAE Training_MAPE Test_MAE Test_MAPE
80799.39 0.1909767 58297.31 0.1963437

Cross Validation

fitControl <- trainControl(method = "cv", number = 100)
set.seed(825)

reg.cv <- 
  train(price ~ ., data = M.test %>% 
                                dplyr::select(price,sale_year,storyheigh,fullbaths,halfbaths,bedrooms,units,heatedarea,heatedfuel,foundation,bldggrade,shape_Leng,year,college_nn1,church_nn1,park_nn2,transit_nn3), 
     method = "lm", trControl = fitControl, na.action = na.pass)
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
reg.cv
## Linear Regression 
## 
## 426 samples
##  16 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (100 fold) 
## Summary of sample sizes: 422, 420, 422, 422, 423, 421, ... 
## Resampling results:
## 
##   RMSE      Rsquared   MAE     
##   76670.93  0.7252262  57038.27
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE

Table of Cross Validation Results

We part the data frame into 100 equal sized subsets and train on a subset of observations, predict on a test set, and measure goodness of fit. The standard diviation of the MAE are different in different folders. This means this model can not predict precisely in different folds.

reg.cv$resample[,]
##          RMSE    Rsquared       MAE Resample
## 1    31720.19 0.716823881  23105.50  Fold001
## 2    95792.81 0.743850800  54471.97  Fold002
## 3   127132.11 0.725732912 113461.32  Fold003
## 4    28787.48 0.874503268  19252.47  Fold004
## 5    73002.65 0.920649278  60763.94  Fold005
## 6    29262.21 0.970050964  28883.55  Fold006
## 7    44318.88 0.138118822  35035.99  Fold007
## 8    28770.43 0.981474909  19497.86  Fold008
## 9    22001.01 0.823775149  19530.12  Fold009
## 10   93640.96 0.488169848  64640.38  Fold010
## 11   33455.20 0.689185678  26418.09  Fold011
## 12   38676.69 0.840749695  34065.85  Fold012
## 13   62131.37 0.837363384  51779.44  Fold013
## 14  107958.31 0.995160302  83562.77  Fold014
## 15   48022.29 0.540691472  46616.15  Fold015
## 16  205094.06 0.284738986 159478.98  Fold016
## 17   61822.83 0.703318526  41097.63  Fold017
## 18   98434.28 0.665433215  75094.06  Fold018
## 19   55140.45 0.498461038  34142.00  Fold019
## 20  100642.36 0.356964240  66682.26  Fold020
## 21  101753.72 0.999902172  77671.08  Fold021
## 22   98852.29 0.109443735  74915.34  Fold022
## 23   19400.34 0.980397786  19090.28  Fold023
## 24  100012.95 0.892138759  91723.82  Fold024
## 25   89692.01 0.974790826  63597.82  Fold025
## 26   64765.68 0.988495524  50776.06  Fold026
## 27   48615.56 0.644452128  38565.37  Fold027
## 28   37518.26 0.605148622  32775.49  Fold028
## 29   73561.57 0.968397426  53339.38  Fold029
## 30   91838.30 0.748856374  74768.08  Fold030
## 31   41349.85 0.969699377  32033.48  Fold031
## 32  104880.99 0.700745397  92866.77  Fold032
## 33   21207.95 0.964024820  16251.18  Fold033
## 34   56773.13 0.779004672  41976.51  Fold034
## 35   25238.45 0.866156306  21425.14  Fold035
## 36  168718.81 0.442833638  88722.54  Fold036
## 37  108513.25 0.928675506  81245.64  Fold037
## 38   33169.10 0.888432743  30910.43  Fold038
## 39  109312.36 0.574421411  77052.24  Fold039
## 40   89539.39 0.881172661  62088.45  Fold040
## 41   77211.91 0.890906580  52182.29  Fold041
## 42   32344.26 0.855199695  23991.48  Fold042
## 43   71919.38 0.951358655  66939.96  Fold043
## 44   54247.56 0.777301933  49663.94  Fold044
## 45  409923.32 0.934051062 227944.56  Fold045
## 46   35642.81 0.821403618  27543.54  Fold046
## 47   62871.52 0.323974709  57835.20  Fold047
## 48   35350.09 0.751281273  31077.08  Fold048
## 49   28940.82 0.840497656  25574.44  Fold049
## 50  125084.48 0.228620456  87509.80  Fold050
## 51  284118.37 0.225522061 216848.75  Fold051
## 52  175238.79 0.878704233 115726.52  Fold052
## 53   69645.45 0.034468479  58395.07  Fold053
## 54   51020.04 0.001612873  41583.90  Fold054
## 55   64249.77 0.703966598  55939.78  Fold055
## 56   41659.34 0.695942401  32757.02  Fold056
## 57   72368.45 0.503139869  57824.42  Fold057
## 58   32692.77 0.993588310  32272.98  Fold058
## 59   98379.15 0.982637063  71480.38  Fold059
## 60   28822.79 0.919063474  25337.20  Fold060
## 61   53744.32 0.864519196  38546.61  Fold061
## 62   64499.41 0.541037003  53211.92  Fold062
## 63  122451.85 0.002424432  94964.95  Fold063
## 64   21090.28 0.951791624  17969.37  Fold064
## 65  122510.85 0.098925991  72841.74  Fold065
## 66   35206.03 0.977037044  31282.31  Fold066
## 67   47405.77 0.900236607  39307.44  Fold067
## 68   76795.02 0.872472777  39933.90  Fold068
## 69   79177.79 0.985076003  49628.46  Fold069
## 70   71261.05 0.895193741  66422.25  Fold070
## 71   61490.26 0.878665107  58085.18  Fold071
## 72   34906.25 0.737038351  34676.91  Fold072
## 73   52900.43 0.903767098  43132.32  Fold073
## 74   97420.64 0.918958611  77129.94  Fold074
## 75   60123.46 0.013702482  52789.87  Fold075
## 76   38148.42 0.750728276  27896.18  Fold076
## 77   90195.57 0.018334994  80312.14  Fold077
## 78   43584.72 0.708801573  34791.87  Fold078
## 79   39910.75 0.623035790  31507.49  Fold079
## 80   28493.35 0.993220935  25206.21  Fold080
## 81   56170.69 0.561999105  37615.75  Fold081
## 82  385555.23 0.857323796 224856.52  Fold082
## 83   15025.70 0.996447845  14626.80  Fold083
## 84   45458.26 0.724718146  36277.23  Fold084
## 85  169200.94 0.759597897 140398.95  Fold085
## 86   47522.61 0.961810708  44525.66  Fold086
## 87   25009.11 0.836666296  19528.03  Fold087
## 88   98930.61 0.939890298  82404.34  Fold088
## 89   58191.38 0.708175848  40328.77  Fold089
## 90   29562.65 0.938892545  24319.10  Fold090
## 91  105673.10 0.300797239  61846.65  Fold091
## 92   26862.76 0.891887358  24848.83  Fold092
## 93   84145.24 0.553005571  56822.63  Fold093
## 94   78459.06 0.793118424  71164.53  Fold094
## 95   39206.20 0.795314712  30732.83  Fold095
## 96   39308.78 0.853618091  31129.53  Fold096
## 97  182175.24 0.951499697  98849.42  Fold097
## 98  105677.49 0.716735193  69069.90  Fold098
## 99   72009.30 0.880181377  51172.13  Fold099
## 100  39381.13 0.850328852  35868.41  Fold100
mean(reg.cv$resample[,3])
## [1] 57038.27
sd(reg.cv$resample[,3])
## [1] 39941.01

Distribution MAE

The distribution of MAE is not aggregated enough, which there are still many scattered distributions. We consider that the reasons for this distribution may be: 1. The presence of some extreme values 2. The presence of spatial correlation.

hist(reg.cv$resample[,3],
     breaks = 50,
     col = "#d5d5e6", 
     border = "#6A6ACD",
     xlab = "Mean Absolute Error",
     ylab = "Count",
     main = "Distribution of MAE\nK fold cross validation;k=100"
     )

Prediction Error

ggplot(M.test.sf, aes(x = Price.AbsError)) +
        geom_histogram(binwidth = 5000, color = "#6A6ACD", fill = "#d5d5e6", size = 0.5) +
  scale_x_continuous(name = "Sale Price Absolute Error",
                     breaks = seq(0, 800000, 100000),
                     limits=c(0, 800000)) +
  scale_y_continuous(name = "Count") +
  ggtitle("Distribution of Price Prediciton Error") +
  theme_light()
## Warning: Removed 1 rows containing non-finite values (stat_bin).
## Warning: Removed 2 rows containing missing values (geom_bar).

Maps

Test Set Predicted Price Map

ggplot() +
  geom_sf(data = bd, fill = "grey90", color = "white", size = 0.3) +
  geom_sf(data = M.test.sf, aes(colour = q5(price)), 
          show.legend = "Price.Error", size = 1, alpha = .5) +
  scale_colour_manual(values = palette5,
                      labels=qBr(M.test.sf,"price"),
                      name="Predicted Sale Price\nin US Dollars\n(Quintile Breaks)") +
  labs(title="Test Set Predicted Sale Prices") +
  theme(plot.title = element_text(hjust = 0.5,size = 20)) +
  mapTheme()

Distribution of MAE Errors

The MAE of sale price illustrates, prices clustering at different spatial scales.

ggplot() +
  geom_sf(data = bd, fill = "grey90", color = "white", size = 0.3) +
  geom_sf(data = M.test.sf, aes(colour = q5(Price.Error)), 
          show.legend = "Price.Error", size = 1, alpha = .5) +
  scale_colour_manual(values = palette5,
                      labels=qBr(M.test.sf,"Price.Error"),
                      name="The MAE of Sale Price\n(Quintile Breaks)") +
  labs(title="Test Set Absolute Sale Price Errors") +
  theme(plot.title = element_text(hjust = 0.5,size = 15))+
mapTheme()

Spatial Lag

Even if a regression perfectly accounts for internal characteristics, public services and amenities, it may still have errors that cluster in space. spatial autocorrelation is another essential thing we need to consider. Thus, we calculate the neighborhood and create its nearest neighbor by knn2nb.

modelling_sub_200k.sf <- modelling %>% 
filter(price <= 2000000)

coords <- st_coordinates(modelling_sub_200k.sf) 

neighborList <- knn2nb(knearneigh(coords, 5))

spatialWeights <- nb2listw(neighborList, style="W")

modelling_sub_200k.sf$lagPrice<-
  lag.listw(spatialWeights,modelling_sub_200k.sf$price)

The correlation for the spatial lag relationship is 0.83 and is highly statistically significant, indicating clustering of house prices.

As prices rise, the prices of nearby houses also rise.

ggscatter(modelling_sub_200k.sf, x = "lagPrice", y = "price",
          add = "reg.line", 
          conf.int = TRUE,    
          add.params = list(fill = "lightgray", color = "#dea243"),
          size = .4, alpha = .1, color = "#2d2d42") +
  stat_cor(method = "pearson", 
           label.x = 3, label.y = 30) +
  labs(title = "Price as a function of the spatial lag") +
  theme_minimal()
## `geom_smooth()` using formula 'y ~ x'

coords.test <- st_coordinates(M.test.sf)

neighborList.test <- knn2nb(knearneigh(coords.test, 5))

spatialWeights.test <- nb2listw(neighborList.test, style="W")

M.test.sf$lagPriceError<-
  lag.listw(spatialWeights.test,M.test.sf$Price.Error)

The relationship describes a significant correlation of Error as a function of the spatial lag of price. The interpretation is that as home price errors increase, so does nearby home price errors. That model errors are spatially autocorrelated suggests that critical spatial information has been omitted from the model.

st_drop_geometry(M.test.sf)%>%
  ggplot(aes(lagPriceError,Price.Error)) +
  geom_point(size = 1,alpha = .4, color = "#2d2d42") + 
  geom_smooth(method = "lm", se=F, color = "#dea243") +
       labs(title = "Error as a function of the spatial lag of error") +
  theme_minimal()
## `geom_smooth()` using formula 'y ~ x'

Testing residual distribution

Reg_Dataframe <- cbind(reg2$residuals,reg2$fitted.values)
Reg_Dataframe <- as.data.frame(Reg_Dataframe)

colnames(Reg_Dataframe) <- c("residuals", "predictedValues")

ggplot(reg2, aes(Reg_Dataframe$residuals)) + 
  geom_histogram(bins=100, color = "#6A6ACD", fill = "#d5d5e6", size = 0.5) +
  labs(x="Residuals",
       y="Count")+
  theme_light()

ggplot(data = Reg_Dataframe, aes(x = residuals , y = predictedValues)) +
  geom_point(size = .3,alpha = .4, color = "#2d2d42") + 
  xlab("Residuals") + 
  ylab("Predicted Values") + 
  ggtitle("Residual Values vs. Predicted Values") +  
  theme(plot.title = element_text(hjust = 0.5)) +
  theme_minimal()

Residuals Map

ggplot() +
  geom_sf(data = bd, fill = "grey90", color = "white", size = 0.4) +
  geom_sf(data = M.test.sf, aes(colour = q5(Price.Error)), 
          show.legend = "Price.Error", size = .9, alpha = .6) +
  scale_colour_manual(values = palette5,
                      labels=qBr(M.test.sf,"Price.Error"),
                      name="The Residuals Error\n(Quintile Breaks)") +
  labs(title="Residuals Distribution in Test Set") +
  theme(plot.title = element_text(hjust = 0.5,size = 15))+
mapTheme()

ggplot() +
  geom_sf(data = bd.sf, fill = "grey90", color = "white", size = 0.4) +
  geom_sf(data = M.test.sf, aes(colour = q5(lagPriceError)),
          show.legend = "point", size = .75) +
  scale_colour_manual(values = palette5,
                   labels=qBr(M.test.sf,"lagPriceError"),
                   name="Quintile\nBreaks") +
  labs(title="a map of spatial lag in errors") +
  theme(plot.title = element_text(hjust = 0.5)) +
  mapTheme()

Moran’s I Test

The Moran’s I test results confirm our interpretation of the map. The Clustered point process yields a middling I of 0.31(Moran’s I value). But a p-value of 0.001 suggests that the observed point process is more clustered than all 999 random permutations (1 / 999 = 0.001) and is statistically significant.

moranTest <- moran.mc(M.test.sf$Price.Error, 
                      spatialWeights.test, nsim = 999)

ggplot(as.data.frame(moranTest$res[c(1:999)]), aes(moranTest$res[c(1:999)])) +
  geom_histogram(binwidth = 0.01,fill = "#d5d5e6") +
  geom_vline(aes(xintercept = moranTest$statistic), colour = "#dea243",size=1) +
  scale_x_continuous(limits = c(-1, 1)) +
  labs(title="Observed and permuted Moran's I",
       subtitle= "Observed Moran's I in Yellow",
       x="Moran's I",
       y="Count") +
  theme_minimal()
## Warning: Removed 2 rows containing missing values (geom_bar).

Neighborhood Model

In the previous analysis, it was concluded that the residuals have a strong spatial correlation. So we added a neighborhood indicator,hoping that the model would fit better.

But after our attempts to use the neiborhood boundary data failed, we switched to using the Name data from census data as an alternative.

left_join(
  st_drop_geometry(M.test.sf) %>%
    group_by(NAME) %>%
    summarize(meanPrice = mean(price, na.rm = T)),
  mutate(M.test, predict.fe = 
                        predict(lm(price ~ NAME, data = M.test.sf), 
                        M.test)) %>%
    st_drop_geometry %>%
    group_by(NAME) %>%
      summarize(meanPrediction = mean(predict.fe))) %>%
      kable() %>% kable_minimal()
## Joining, by = "NAME"
NAME meanPrice meanPrediction
Census Tract 14, Mecklenburg County, North Carolina 273000.0 273000.0
Census Tract 15.04, Mecklenburg County, North Carolina 357000.0 357000.0
Census Tract 15.05, Mecklenburg County, North Carolina 271000.0 271000.0
Census Tract 15.08, Mecklenburg County, North Carolina 205000.0 205000.0
Census Tract 16.08, Mecklenburg County, North Carolina 305000.0 305000.0
Census Tract 19.10, Mecklenburg County, North Carolina 220000.0 220000.0
Census Tract 19.22, Mecklenburg County, North Carolina 239437.5 239437.5
Census Tract 19.23, Mecklenburg County, North Carolina 231333.3 231333.3
Census Tract 30.06, Mecklenburg County, North Carolina 615375.0 615375.0
Census Tract 30.08, Mecklenburg County, North Carolina 390500.0 390500.0
Census Tract 30.12, Mecklenburg County, North Carolina 1025000.0 1025000.0
Census Tract 30.19, Mecklenburg County, North Carolina 545900.0 545900.0
Census Tract 31.05, Mecklenburg County, North Carolina 505000.0 505000.0
Census Tract 31.10, Mecklenburg County, North Carolina 378000.0 378000.0
Census Tract 33.02, Mecklenburg County, North Carolina 963500.0 963500.0
Census Tract 36, Mecklenburg County, North Carolina 437000.0 437000.0
Census Tract 38.02, Mecklenburg County, North Carolina 290000.0 290000.0
Census Tract 40, Mecklenburg County, North Carolina 200500.0 200500.0
Census Tract 41.01, Mecklenburg County, North Carolina 585000.0 585000.0
Census Tract 43.02, Mecklenburg County, North Carolina 283500.0 283500.0
Census Tract 43.03, Mecklenburg County, North Carolina 195000.0 195000.0
Census Tract 43.07, Mecklenburg County, North Carolina 375000.0 375000.0
Census Tract 48, Mecklenburg County, North Carolina 350000.0 350000.0
Census Tract 53.05, Mecklenburg County, North Carolina 103000.0 103000.0
Census Tract 53.06, Mecklenburg County, North Carolina 247000.0 247000.0
Census Tract 53.07, Mecklenburg County, North Carolina 207500.0 207500.0
Census Tract 54.05, Mecklenburg County, North Carolina 226392.9 226392.9
Census Tract 54.06, Mecklenburg County, North Carolina 245750.0 245750.0
Census Tract 55.19, Mecklenburg County, North Carolina 385000.0 385000.0
Census Tract 55.20, Mecklenburg County, North Carolina 286000.0 286000.0
Census Tract 55.29, Mecklenburg County, North Carolina 338000.0 338000.0
Census Tract 55.30, Mecklenburg County, North Carolina 373333.3 373333.3
Census Tract 55.31, Mecklenburg County, North Carolina 385625.0 385625.0
Census Tract 55.35, Mecklenburg County, North Carolina 260000.0 260000.0
Census Tract 56.15, Mecklenburg County, North Carolina 361375.0 361375.0
Census Tract 56.17, Mecklenburg County, North Carolina 269088.2 269088.2
Census Tract 56.18, Mecklenburg County, North Carolina 436625.0 436625.0
Census Tract 56.19, Mecklenburg County, North Carolina 329583.3 329583.3
Census Tract 56.21, Mecklenburg County, North Carolina 369974.4 369974.4
Census Tract 56.26, Mecklenburg County, North Carolina 324121.6 324121.6
Census Tract 56.27, Mecklenburg County, North Carolina 255944.4 255944.4
Census Tract 57.10, Mecklenburg County, North Carolina 204000.0 204000.0
Census Tract 57.14, Mecklenburg County, North Carolina 504416.7 504416.7
Census Tract 57.16, Mecklenburg County, North Carolina 298083.3 298083.3
Census Tract 57.19, Mecklenburg County, North Carolina 349000.0 349000.0
Census Tract 57.21, Mecklenburg County, North Carolina 384400.0 384400.0
Census Tract 57.22, Mecklenburg County, North Carolina 358000.0 358000.0
Census Tract 58.26, Mecklenburg County, North Carolina 236500.0 236500.0
Census Tract 58.30, Mecklenburg County, North Carolina 308625.0 308625.0
Census Tract 58.50, Mecklenburg County, North Carolina 383000.0 383000.0
Census Tract 58.53, Mecklenburg County, North Carolina 357500.0 357500.0
Census Tract 58.55, Mecklenburg County, North Carolina 529000.0 529000.0
Census Tract 59.13, Mecklenburg County, North Carolina 402000.0 402000.0
Census Tract 59.18, Mecklenburg County, North Carolina 315277.8 315277.8
Census Tract 59.19, Mecklenburg County, North Carolina 393555.6 393555.6
Census Tract 59.20, Mecklenburg County, North Carolina 810400.0 810400.0
Census Tract 59.21, Mecklenburg County, North Carolina 441000.0 441000.0
Census Tract 59.22, Mecklenburg County, North Carolina 415000.0 415000.0
Census Tract 59.24, Mecklenburg County, North Carolina 323000.0 323000.0
Census Tract 59.25, Mecklenburg County, North Carolina 309500.0 309500.0
Census Tract 59.26, Mecklenburg County, North Carolina 462666.7 462666.7
Census Tract 60.05, Mecklenburg County, North Carolina 352357.1 352357.1
Census Tract 60.08, Mecklenburg County, North Carolina 280333.3 280333.3
Census Tract 60.09, Mecklenburg County, North Carolina 302750.0 302750.0
Census Tract 60.11, Mecklenburg County, North Carolina 237857.1 237857.1
Census Tract 60.12, Mecklenburg County, North Carolina 264500.0 264500.0
Census Tract 60.13, Mecklenburg County, North Carolina 250500.0 250500.0
Census Tract 60.14, Mecklenburg County, North Carolina 377000.0 377000.0
Census Tract 60.15, Mecklenburg County, North Carolina 244333.3 244333.3
Census Tract 60.16, Mecklenburg County, North Carolina 252250.0 252250.0
Census Tract 61.03, Mecklenburg County, North Carolina 708500.0 708500.0
Census Tract 61.08, Mecklenburg County, North Carolina 212500.0 212500.0
Census Tract 61.10, Mecklenburg County, North Carolina 283666.7 283666.7
Census Tract 61.11, Mecklenburg County, North Carolina 271500.0 271500.0
Census Tract 61.14, Mecklenburg County, North Carolina 277666.7 277666.7
Census Tract 62.23, Mecklenburg County, North Carolina 340000.0 340000.0
Census Tract 63.06, Mecklenburg County, North Carolina 346100.0 346100.0
Census Tract 63.09, Mecklenburg County, North Carolina 445000.0 445000.0
Census Tract 63.10, Mecklenburg County, North Carolina 738666.7 738666.7
Census Tract 63.11, Mecklenburg County, North Carolina 515000.0 515000.0
Census Tract 64.04, Mecklenburg County, North Carolina 694000.0 694000.0
Census Tract 9, Mecklenburg County, North Carolina 980000.0 980000.0

Re-estimating the regression with the neighborhood Name feature.

reg.nhood <- lm(price ~ ., data = as.data.frame(M.train.sf) %>%
                  dplyr::select(NAME,price,sale_year,fullbaths, halfbaths,bedrooms,units,heatedarea,heatedfuel,foundation,bldggrade,shape_Leng,year,college_nn1,church_nn1,park_nn2,transit_nn3,pop,med_hh_income,belpov100,pctvacant,pctunemployed,pctgrad))
M.test.nhood <-
  M.test.sf %>%
  mutate(Regression = "Neighborhood Effects",
         Price.Predict = predict(reg.nhood, M.test.sf),
         Price.Error = Price.Predict- price,
         Price.AbsError = abs(Price.Predict- price),
         Price.APE = (abs(Price.Predict- price)) / price) %>%
  filter(price < 2000000)
## Warning in predict.lm(reg.nhood, M.test.sf): prediction from a rank-deficient
## fit may be misleading

Prediction Challenge Dataset

challenge <-
  challenge %>%
  mutate(Price.Predict = predict(reg.nhood, challenge))   
## Warning in predict.lm(reg.nhood, challenge): prediction from a rank-deficient
## fit may be misleading

Table of goodness of fit (test set)

The R square of our new neighborhood model nearly in the range of 0.82-0.84. We are satisfied with the result.

stargazer(reg.nhood, 
          type = "text",
          title="Test Set Regression Results",
          keep=c("Constant "))
## 
## Test Set Regression Results
## ================================================
##                         Dependent variable:     
##                     ----------------------------
##                                price            
## ================================================
## Observations                   45,138           
## R2                             0.835            
## Adjusted R2                    0.834            
## Residual Std. Error   107,803.500 (df = 44815)  
## F Statistic         703.485*** (df = 322; 44815)
## ================================================
## Note:                *p<0.1; **p<0.05; ***p<0.01
#write.csv(challenge,"/Users/paprika/Desktop/Courses/MUSA 508 Public Policy/Midterm/predictions.csv", row.names = FALSE)

The AbsError and APE all decrease, which means the Neighborhood Effects model is more accurate on both a dollars and percentage basis.

M.test.sf$Regression <- "Baseline Regression"
bothRegressions <- 
  rbind(
    dplyr::select(M.test.sf, starts_with("price"), Regression, NAME) %>%
      mutate(lagPriceError = lag.listw(spatialWeights.test, Price.Error)),
    dplyr::select(M.test.nhood, starts_with("price"), Regression, NAME) %>%
      mutate(lagPriceError = lag.listw(spatialWeights.test, Price.Error)))  

st_drop_geometry(bothRegressions) %>%
  gather(Variable, Value, -Regression, -NAME) %>%
  filter(Variable == "Price.AbsError" | Variable == "Price.APE") %>%
  group_by(Regression, Variable) %>%
    summarize(meanValue = mean(Value, na.rm = T)) %>%
    spread(Variable, meanValue) %>%
    kable() %>% kable_minimal()
## Warning: attributes are not identical across measure variables;
## they will be dropped
## `summarise()` has grouped output by 'Regression'. You can override using the
## `.groups` argument.
Regression Price.AbsError Price.APE
Baseline Regression 58297.31 0.1963437
Neighborhood Effects 44578.81 0.1354487

Predicted prices are plotted as a function of observed prices. Recall the purple line represents a would-be perfect fit, while the yellow line represents the predicted fit.

bothRegressions %>%
  dplyr::select(Price.Predict, price, Regression) %>%
    ggplot(aes(price, Price.Predict)) +
  geom_point(size = .5,alpha = .3, colour = "grey15") +
  stat_smooth(aes(price, price), 
             method = "lm", se = FALSE, size = 1, colour="#5252B8") + 
  stat_smooth(aes(Price.Predict, price), 
              method = "lm", se = FALSE, size = 1, colour="#D0A561") +
  facet_wrap(~Regression) +
  labs(title="Predicted sale price as a function of observed price", size = 15,
       subtitle="Purple line represents a perfect prediction \nYellow line represents prediction") +
  theme_minimal()
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'

Spatial Autocorrelation

mm_table <- left_join(
  st_drop_geometry(M.test.sf) %>%
    group_by(NAME) %>%
    summarize(meanPrice = mean(price, na.rm = T)),
  
 st_drop_geometry(M.test.sf) %>%
    group_by(NAME) %>%
    summarize(MAPE = mean(Price.APE, na.rm = T)))
## Joining, by = "NAME"

There is no strong relationship between MAPE and average housing price, when MAPE is measured in blocks.

mm_table <- merge(acsTracts20, 
                  mm_table,by.x = 'NAME',
                  by.y = 'NAME') 

mm_table <- mm_table[,c('NAME','MAPE','meanPrice')]

ggplot(data=st_drop_geometry(mm_table), 
       aes(meanPrice, MAPE)) +
  geom_point(size = .8,alpha = .7, colour = "grey15") +
  labs(title = 'MAPE as a function of mean price in each census tract') +
  theme(plot.title = element_text(hjust = 0.5)) +
  theme_minimal()

MAPE Map

ggplot() +
  geom_sf(data = bd, fill = "grey90", color = "white", size = 0.3) +
  geom_sf(data = mm_table, color = "white",size = 0.1, aes(fill = MAPE)) +
  scale_fill_gradient(low = "#5252B8", 
                      high = "#D0A561",
                      name = "Quintile\nBreaks") +
  theme(plot.title = element_text(hjust = 0.5))+
  labs(title = "MAPE of Each Census Tract") +
  mapTheme()

Generalizability

The fit in the two partitions is basically the same, and the ideal reason is because the fit of our model effect is better.

We are not sure whether it is an ideal model, because we have calculated the Census data factors in regression model. Maybe it is due to these factors in census data has the same spatial lag. We’re still debating whether using census data for spatial generalizability test is a good choose.

The fit is relatively better in high-income areas, and perhaps not friendly to low-income groups if such a model is used for taxation and housing stock prices.

ggplot() + 
  geom_sf(data = na.omit(acsTracts20),
          aes(fill = raceContext), 
          color = "white", 
          size = 0.4
          ) +
    scale_fill_manual(values = c("#F3CD88", "#6A6ACD"),
                      name="Race Context") +
    labs(title = "Percent of the White People") +
  mapTheme()

bothRegressions <-
  bothRegressions  %>% 
  st_as_sf(coords = c("Longitude", "Latitude"), crs = 4326, agr = "constant") %>%
  st_transform('ESRI:102179')

acsTracts20 <-
  acsTracts20 %>% 
  st_as_sf(coords = c("Longitude", "Latitude"), crs = 4326, agr = "constant") %>%
  st_transform('ESRI:102179')
  
st_join(bothRegressions, acsTracts20) %>% 
  filter(!is.na(raceContext)) %>%
  group_by(Regression, raceContext) %>%
  summarize(mean.MAPE = scales::percent(mean(Price.APE, na.rm = T))) %>%
  st_drop_geometry() %>%
  spread(raceContext, mean.MAPE) %>%
  kable(caption = "Test set MAPE by neighborhood income context") %>% kable_minimal()
## `summarise()` has grouped output by 'Regression'. You can override using the
## `.groups` argument.
Test set MAPE by neighborhood income context
Regression Majority Non-White Majority White
Baseline Regression 21% 16%
Neighborhood Effects 14% 12%

Discussion

We were satisfied with our operation of converting factor types (converting a lot of numerical data to categlorical data) and the final model showed that they served as a good fit. But there are still many interesting variables that we have not yet had time to consider. For example, we did not find a suitable data source for the crime rate. Some of the data are only available as geometry data divided by administrative areas. We did not find a good way to import them into the model.

Unfortunately, we did not have more time to steptest our model, so the prediction model still contains 23 independent variables.

According to the MAPE and mean house price plots, there is no strong association between our MAPE and mean house price, indicating that our neighborhood factor has an effect.

Conclusion

We would recommend this model. first we calculated our APE in the test set to be around 0.16, which is an acceptable margin of error.

When we compare the baseline model with the neighborhood model, we find that the regression curve becomes very well fitted after adding the spatial factor. the value of r-squared also remains stable at around 0.83.

And in terms of generalizability, our model has relatively strong generalizability across different kinds of partitions. However, it is worth mentioning that our model seems to fit better for high-income communities. The reason may be due to the larger amount of data in high-income communities or the richer facilities in high-income communities.

There are still some limitations in our model. If we need to create a super model that is broadly applicable to many types of communities, firstly, we are going to find the best predictive features or variables. Secondly, we have to try to inject enough predictive power into the model to make good predictions without over-fitting.